Tools for Test and Debug : Dealing with determinism in DSPs with pipelines and caches
Dealing with determinism in DSPs with pipelines and caches
By Giuseppi Olivadoti, Technical Marketing Specialist, DSP Tools Product Line, DSPS Division, Analog Devices, Norwood, Mass., EE Times
June 3, 2002 (10:16 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020531S0033
To increase the performance of time-critical code, new DSPs are adopting RISC architectural features such as caches and pipelines. While cache implementations can provide a simpler programming model, real-time DSP code requires more determinism than a cache-based scheme can provide requiring close analysis to resolve potential bottlenecks in the data flow.
In the past, system designers had to trade a simpler programming model for higher performance. Now, integrated development tools are reaching the market that can achieve both maximum performance and a simple programming model. Such tools include new utilities that allow developers to visually analyze the run-time characteristics of applications executing on DSPs that include a cache and pipeline, thereby improving determinism of code execution.
Determinism, a key feature of real-time systems, is the ability to determine with high probability how long a task takes to execute. Real-t ime systems require the completion of certain tasks before others can start.
Historically, DSPs have relied heavily on the fact that individual instruction execution time could be quantitatively expressed. New DSP architectures include caches and pipelines that can provide a simple programming model - and the determinism required for real-time, critical code.
Unfortunately, caches and pipelines can create stalls that interfere with an application's determinism. This is problematic, because a DSP that normally executes instructions rapidly must halt to wait for a stall condition to be remedied.
Pipelining is an implementation technique in which multiple instructions are overlapped in execution. In a pipelined architecture, sets of instructions execute consecutively, without latency, as long as sufficient resources are available. In a pipeline, the time required to execute a set of instructions does not diminish, but more of the processor resources are exercised during each cyc le. Problems occur when an instruction in a pipeline depends on the result of a conditional branch or a data dependency.
If the branch is not taken, that is, when the assumption about the instruction taken after the conditional branch is incorrect, a "bubble" is placed in the pipeline. This bubble is a wasted cycle. Further, all the instructions in the pipeline after the bubble may be invalid.
The penalty for making the incorrect assumption could be equal to the depth of the pipeline. A data dependency issue occurs when data needed by an instruction in the pipeline is not yet available. Most DSPs running at over 300 MHz commonly have pipelines of at least eight stages, in which case the penalty for an incorrect assumption would be eight cycles.
A number of recent debug and analysis tools allow the developer to visualize this pipeline, aiding his or her understanding of how applications execute within the pipeline. Viewing the contents of the pipeline shows precisely where bub bles occur. Identifying these bubbles can help the developer to minimize those occurrences of code that stall the execution and disrupt determinism.
By viewing instructions in the pipeline, designers can get a glimpse of what is going on. Some instructions can propagate through the pipeline with no stalls, while others cause stalls; these stalls introduce bubbles that propagate through the pipeline before it can run optimally again. By identifying the bubbles in the pipeline the developer could either account for them, giving determinism, or attempt to eliminate them, giving higher performance.
Cache is a hierarchical memory component based on the assumption that memory close to the core of the DSP runs faster than larger banks of memory farther down the hierarchy. The closer memory in a hierarchy is to the DSP, the faster and more expensive per bit it is. The level closest to the DSP will execute instructions with no memory latency. The goal of an ideal cache system is to give the imp ression that all system memory is fast, no-latency memory, yet in a real memory system, cache is generally only a small portion of the total addressable memory.
Cache relies on the theory of spatial and temporal locality, and makes assumptions about the instructions that should be present at any given time in order to maximize its no-latency-hit rate. Hit rate, which is a common measure of cache performance, refers to the percentage of total cache accesses that results in finding the right instruction or data element in the cache.
In a cache system, if the right instruction or data element is not found, it is called a 'miss'. There are three types of cache misses: compulsory, capacity, and conflict. A compulsory miss occurs when the cache is empty and misses on the first access to a block. A capacity miss occurs when not all of the necessary blocks can fit within the cache during the execution of a program. A conflict miss occurs when multiple blocks compete for storage in the same s et.
Using development tools that contain linear profiler and cache viewer windows help developers visualize in perfect detail how an application's cache system is performing. For example, a cache viewer helps the user visualize the complex interaction between application code and cache performance, allowing the user to deterministically describe the system's performance. Not only will the user be able to achieve deterministic behavior, but also optimal performance.
Since there are no capacity misses and mostly compulsory misses, a developer could conclude that sacrificing some cache capacity for more information in the cache initially would be a beneficial trade-off. Certain DSPs, have the ability to pre-fetch, and lock down a way in cache; both of which are methods for customizing the cache operation.
The ability to lock down a cache, which means that part of the cache is primed with the instructions and data it needs, can greatly reduce the number of compulsory cache misses wit hout impacting the needed cache capacity. Because not all of the cache capacity is used in the previous example, portions of code can be locked down to lower the number of compulsory miss events.
If the system under analysis has misses, it is difficult to ensure that a task will complete in a deterministic fashion. If the hit rate were to approach 100 percent, the level of certainty that a task will execute in a deterministic length of time would be high especially if the developer can examine the cache system and to arrange the application to minimize cache misses. This capability makes all memory appear as though it resides in very fast, no-latency memory without the cost.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Accelerating SoC Evolution With NoC Innovations Using NoC Tiling for AI and Machine Learning
- Paving the way for the next generation of audio codec for True Wireless Stereo (TWS) applications - PART 5 : Cutting time to market in a safe and timely manner
- Calibrate and Configure your Power Management IC with NVM IP
- Capitalizing on the Architectural Flexibility of FPGAs with RISC-V and a Simplified Programming Flow
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience