How memory architectures affect system performance
Victor Echevarria
(01/31/2005 8:44 PM EST)
Since the mid-1990s, memory technologies have mostly been named according to how fast they run. A PC100 SDRAM device would operate at a 100MHz data rate, PC133 at a 133MHz data rate, and so on. While variations on this branding convention have evolved over time, most give a potential buyer an idea of how fast a memory device will operate.
Simply translated, most memory technologies, indeed all of today's mainstream memory technologies, have branded themselves with a peak data rate, which has been and continues to be one of the most important factors when calculating the performance of a memory system. However, no memory device operating in a real system operates at its peak data rate 100 percent of the time.
Switching from writing to reading, accessing certain addresses at certain times, and refreshing data all require some amount of inactivity on the data bus, preventing full utilization of your memory channel. Additionally, wide parallel busses and DRAM core prefetch both often lead to unnecessarily large data accesses.
The amount of usable data that a memory controller can access in a given period of time is called the effective data rate, and is highly dependent on the specific application of the system. Effective data rates vary with time and are often much lower than peak data rates. In some systems, effective data rates can drop to less than 10 percent of the peak data rate.
Often, these systems could benefit from a change in memory technology that would yield a higher effective data rate. A similar phenomenon exists in the product space of the CPU, where in recent years companies like AMD and Transmeta have shown that clock frequency is not the only important factor when measuring the performance of a CPU-based system.
Memory technologies have also matured to the point where peak and effective data rates might not match as well as they once did. While peak data rate remains one of the most important parameters of a memory technology, other architectural parameters can drastically affect the performance of a memory system.
Parameters that impact effective data rate
There are several classes of parameters that impact effective data rates, one of which induces periods of inactivity onto the data bus. Of the parameters in this class, bus turnaround, row-cycle time, CAS delay, and RAS to CAS delay (tRCD) cause most of the latency headaches for many system architects.
Bus turnaround by itself can create very long periods of inactivity on the data channel. Take, for example, a GDDR3 system that constantly writes data to the open pages of a memory device. During this period, the memory system's effective data rate matches its peak data rate.
However, assume now that over the course of 100 clock cycles, the memory controller switches from reading to writing. Since the penalty for this switch is six cycles, the effective data rate drops to 94 percent of the peak data rate. If in this 100-cycle period, the memory controller switches the bus back from writes to reads, even more cycles are lost.
This memory technology requires 15 idle cycles while switching from writes to reads, further dropping the effective data rate to 79 percent of the peak data rate. Table 1 shows the results of this same calculation for several high-performance memory technologies.
Table 1 — Effective and peak data rates for two bus turn-arounds every 100 cycles
Clearly, all memory technologies are not created equal. System designers requiring many bus turnarounds would benefit from choosing a more efficient technology like XDR, RDRAM, or DDR2. If, on the other hand, a system can group transactions into very long sequences of writes and reads, bus turnarounds will have a minimal effect on effective bandwidth. However, other latency-adding phenomena, like bank conflicts, can negatively impact its performance by reducing effective bandwidth.
DRAM technologies all require that pages, or rows, of a bank be opened before they are accessed. Once open, a different page in the same bank cannot be opened until a minimum period of time known as the row-cycle time (tRC) has elapsed. A memory access to a different page of an open bank is referred to as a page miss and can incur a latency penalty associated with any unsatisfied portion of the tRC interval.
Page misses to banks that have not yet been open for enough cycles to satisfy the tRC interval are called bank conflicts. While tRC determines the magnitude of the bank conflict latency, the number of banks available in a given DRAM will directly influence the frequency at which bank conflicts occur.
Most memory technologies have either four or eight banks and have tRC values in the tens of cycles. Under random workloads, those with eight-bank cores will have fewer bank conflicts than those with four-bank cores. While the interaction between tRC and bank-count is complex, their cumulative impact can be quantified in a number of ways.
Memory read transactions
Consider three simple cases of memory read transactions. In the first case, a memory controller issues every transaction such that it creates a bank conflict with the prior transaction. The controller must wait a time tRC between opening a page and opening the subsequent page, thus adding the maximum amount of latency associated with cycling pages. The effective data rates in this case are largely independent of the I/O and limited mainly by the DRAM core circuitry. Maximum bank conflict rates reduce effective bandwidths to between 20 percent and 30 percent of peak in today's highest-end memory technologies.
In the second case, each transaction targets a randomly generated address. Here, the chance of encountering a bank conflict depends on many factors, including the interaction of tRC and the number of banks in the memory core. The smaller the value of tRC, the sooner an open page can be cycled leading to lower bank conflict penalties. In addition, the more banks a memory technology has, the lower the chance of a bank conflict for random address accesses.
In the third case, every transaction is a page hit, addressing various column addresses in the open pages. The controller need never access a closed page, allowing 100 percent bus utilization, resulting in an ideal case where effective data rate equals peak data rate.
While the first and third cases involve fairly straightforward calculations, the random case is affected by other features not specifically included in the DRAM or the memory interface. Memory controller arbitration and queuing can drastically improve the bank conflict rate since more possibly non-conflicting transactions can be issued instead of those that cause bank conflicts.
However, adding memory queue depth does not necessarily increase the relative effective data rates between two different memory technologies. For example, XDR maintains a 20 percent higher effective data rate than GDDR3 even with added memory controller queue depth. This delta exists primarily due to XDR's higher bank count and lower value of tRC. In general, shorter tRC intervals, higher bank counts, and larger controller queues lead to higher effective bandwidths.
Many efficiency-limiting phenomena are actually problems associated with row-access granularity. tRC constraints essentially require that memory controllers access a certain amount of data from newly opened rows to ensure that the data pipeline is kept full. In essence, to keep the data bus operating without interruption, there is a minimum amount of data that must be read after opening a row, even if the extra data is not needed.
The other major class of features that reduce the effective bandwidth of a memory system fall under the category of column-access granularity, which dictates how much data each individual read and write operation must transfer. In contrast, row-access granularity dictates how many individual read and write operations are required per row activation (commonly referred to as CAS operations per RAS).
Column-access granularity can also have a large, albeit less quantifiable effect on effective data rates. Since it dictates the minimum amount of data transferred in a single read or write, column-access granularity poses a problem for systems that typically only require small amounts of data at a time. For example, a system with 16-byte access granularity that requires eight bytes each from two columns must read a total of 32 bytes to access both locations.
Since only 16 of the 32 bytes were needed, the system experienced a diminished effective data rate equal to 50 percent of its peak data rate. Two architectural parameters dictate the access granularity of a memory system: Bus width and burst length.
Bus width refers to the total number of data traces connected between the memory controller and the memory devices. It sets a minimum access granularity since each data trace must carry at least one bit of data for a given memory transaction. Burst length, in turn, specifies the number of bits each lane must carry for a given transaction. A memory technology that transmits one bit of data per data trace per transaction is said to have a burst length of one. Total column-access granularity is simply:
- Column-access granularity = Bus Width x Burst Length
Many system architects increase the bandwidth available from the memory system by merely adding more DRAM devices and increasing the width of the memory bus. After all, if four links at 400MHz data rate give you 1.6GHz of aggregate peak bandwidth, eight links will give you 3.2GHz. Adding a DRAM device, a few more traces on the board, and the corresponding pins on the ASIC doubles the total aggregate peak bandwidth.
Table 2 shows the total aggregate peak bandwidth achievable with different memory technologies and bus widths as well as the total controller pin count required for each configuration.
Table 2 — Total peak bandwidth for various memory technologies and bus widths with required controller pincount
This gain, however, comes at a price. First and foremost, architects concerned primarily with squeezing out every ounce of peak bandwidth have already reached a feasible maximum for how wide they can physically design their memory busses. It's not uncommon to find graphics controllers that have 256- or even 512-bit wide memory busses, which require 1,000 or more controller pins.
Package designers, ASIC floorplanners, and board designers cannot find the area to route this many signals using inexpensive, commercially viable means. The other problem with merely increasing the bus width to attain a higher peak data rate results in reduced effective bandwidth from column access granularity limitations.
Assuming that the burst length of a particular memory technology can equal 1, the access granularity of a 512-bit wide system is 512-bits (or 64 bytes) for a single memory transaction. If the controller only needs data in smaller chunks, the remaining data is wasted, reducing the effective data rate for the system.
For example, a controller that needs only 32 bytes of data from the memory system mentioned previously would waste the remaining 32 bytes, resulting in an effective data rate equal to 50 percent of the peak data rate. Remember that these calculations all assume a burst-length of one. With the increasing trend in memory interface data rate, most new technologies have a minimum burst length greater than one.
Core prefetch
A feature called core prefetch is primarily responsible for the increase in minimum burst length. DRAM core circuitry cannot keep up with the drastic increase in speed of the I/O circuitry. Since data can no longer be fetched serially from the core to satisfy a controller request, the core usually gives the I/O a set of data much larger than the bus width of the DRAM.
Essentially, the core transfers enough data to and from the interface circuits to keep them busy long enough for the core to prepare for the next operation. For example, assume a DRAM core can only respond to an operation once every nanosecond. However, the interface can sustain data rates of two bits per nanosecond.
Instead of wasting half the capability of the interface, the DRAM core fetches two bits per operation instead of one. After the interface transfers the data, the core is ready to respond to the next request without any added delay. The added core prefetch results in an increased minimum burst length of two and will directly impact the column access granularity.
For every additional signal added to the bus width, the memory interface will transfer two additional bits of data. A 512-bit wide memory system with a minimum burst length of two would therefore have access granularity equal to 1,024 bits (128 bytes). Many systems are not sensitive to minimum access granularity issues since they access data in very large chunks. However, some systems rely on the memory system to provide small units of data and benefit from the use of narrower, more efficient memory technologies.
Table 3 — Access granularity and bus width values for a single memory channel of today's mainstream memory technologies
Conclusion
Effective data rates are becoming more important as memory technologies up their peak data rates. When making memory decisions, designers must take a deeper look into published memory specifications and see how a particular technology's features will interact with the application at hand.
Memory system designers must look beyond the peak data rate specification just as CPU designers slowly phase out the use of gigahertz as the only performance metric. While peak data rate still holds the title of most important specification when it comes to memory interfaces, effective data rates are starting to make headroom with system designers and architects. The performance of tomorrow's products will depend greatly on efficient utilization of their memory systems.
Victor Echevarria is RDRAM Product Manager for the Memory Interface Division at Rambus Inc. He joined Rambus in 2002 as a Systems Engineer. Prior to joining Rambus, Victor interned with Agilent Technologies, where he developed software for their high-speed digital sampling oscilloscopes.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Interfacing High Performance 32-bit Cores To MCU-based Memory Architectures
- Providing memory system and compiler support for MPSoc designs: Customization of memory architectures (Part 2)
- Providing memory system and compiler support for MPSoc designs: Memory Architectures (Part 1)
- How memory architectures affect system performance
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience