Achieving Better Code Optimization in DSP Designs
Achieving Better Code Optimization in DSP Designs
Chris Chung and Xiangdong Fu, Texas Instruments
Jun 22, 2004 (5:00 AM)
URL: http://www.commsdesign.com/showArticle.jhtml?articleID=22100650
Digital signal processors (DSPs) have found widespread use for many computationally intensive signal processing applications in fields such as communications. As the communication applications become more sophisticated and complex, DSPs have continued to evolve in order to meet the ever increasing processing requirements. Many of today's DSP architectures employ a deep pipeline that supports high-frequency operation and a very long instruction word (VLIW) that enables the execution of multiple instructions in parallel. While these architectures have vastly increased DSP performance, they have also vastly increased programming complexity.1 The result of this complexity is that programmers increasingly rely on tools that automate the process of creating and optimizing code. While the first stage of development is usually in C, programmers often find that coding critical blocks directly in the assembly language makes software execution more efficient. But, whether the source is created in C, in assembly, or in a combination of the two, programmers need to understand the DSP architecture in order to optimize code. Knowing how the object code created by the compiler or assembler operates on the underlying hardware enables the programmer to write code that uses the tool more effectively. It also permits fine-tuning of the software in order to maximize the DSP's performance. This article will look at the coding tools that are available to designers working with DSPs. The article will also provide some techniques that designers can employ to optimize code. Coding Tools While the complexity of assembly coding increases, the vastly greater performance gives the program developer more latitude to trade off a little performance for ease of development. As a result, the C language, which is more familiar than DSP assembly to most programmers, has become the first choice for software development. C compilers have a high degree of efficiency in VLIW machines, so developers naturally turn to C when they are trying to create and test software quickly. One of the disadvantages of C compilers in taking full advantage of high-performance DSP architectures is that they cannot always efficiently map complex algorithms to the multiple VLIW functional units. For example, architecture-specific details such as single-instruction multiple-data (SIMD) instructions, which exploit inherent parallelism in many applications, cannot be represented in generic C. In these instances, a machine-specific approach is required in order to extend the utility of C for high-performance DSP architectures. VLIW compilers often support C intrinsics, function-like statements within C that represent assembly instructions. As an example, Code Lists 1 and 2 show generic C and intrinsic C codes, respectively, for a finite impulse response (FIR) algorithm implemented on a VLIW DSP. List 1. Generic C code for FIR filter List 2. C code with intrinsics for FIR filter (partial list). Even with C instrinsics, some important architectural details cannot be represented, which is why assembly coding is often required to achieve maximum attainable performance. Partitioning code for clusters is a good example. In most VLIW DSPs, the functional units are divided into groups called clusters, and each cluster has its own local register file. This clustered approach substantially reduces the hardware complexity due to the smaller number of ports required for each register file. In cluster-based architectures, one of the critical steps that the C compiler performs is to balance or partition instructions over multiple clusters. But the compiler is imperfect, and there is no way to explicitly give it partitioning hints. Therefore, it is often necessary to rewrite blocks of code in assembly in order to perform partitioning effectively. Code List 3 shows an example of assembly code with partitioning that corresponds to the FIR code in Lists 1 and 2. List 3: Assembly code for FIR filter (partial list). Code List 3 is manually written with an optimization technique called software pipelining to maximally utilize multiple functional units. Doing this by hand is tedious and opens more possibilities for errors. Linear assembler is one tool that can automatically perform software pipelining. This assembler allows a programmer to write assembly code serially, as if it were to be performed by a single functional unit. The assembler then turns the serial code into parallel code for multiple functional units. Code List 4 shows an example of assembly code developed using linear assembler. List 4: Linear assembly code for FIR filter (partial list). Table 1 compares the four code segments. Note that the cycle count becomes closer to the ideal cycle count as more architectural details can be specified. Table 1: Performance of FIR Filter Codes What typically happens in software development is that the vast majority of the code, which is necessary for control but is lightly used in operation, is written and remains in C. The remaining small percentage of the code performs the computationally intensive signal processing portions of the algorithm. This code, for the most part within loops, is being performed most of the time that the system is operating. For the program developer, it pays to rewrite and optimize these blocks of code in C intrinsics or linear/hand assembly languages, since a small gain in efficiency can have a great deal of leverage in improving the overall performance of the machine.2 Optimizing Performance It is necessary to set a clear, reasonable optimization goal that takes into account the underlying architectural details. Several important considerations for optimizing code include instruction latency, resource constraints, recurrence constraints, and nested loops. Let's look at these four issues in more detail starting with instruction latency. 1. Instruction Latency Unrolling loops not only reduces branch overhead but also allows more instructions to be grouped together for more efficient instruction scheduling. Software pipelining, which overlaps multiple iterations of a loop, can also largely offset the negative impact of the long latency on performance, typically in conjunction with proper loop unrolling. Figure 1 illustrates an example of modulo software pipelining where loop iterations are initiated at a constant rate known as iteration interval (II).
With traditional DSPs, programmers turned to the machine's assembly language for almost all code development. However, the VLIW architecture of high-performance DSPs makes assembly coding more difficult due to the increased programming complexity, such as multiple functional units with instructions that are statically scheduled and dispatched.
Click here for Code List 1
Click here for Code List 2
Click here for Code List 3
Click here for Code List 4
1The outer and inner loops are merged and unrolled four times for better performance.
2The ideal cycle count is the maximum attainable number of cycles for the loop only. The number of multiply instructions required is 1152, and it is a limiting factor in this code. Since the target DSP can perform 4 multiplies per cycle, the ideal cycle count is 288.
Different DSPs are designed with different characteristics, such as the number of parallel functional units, instructions supported by the units individually, support for instructions with unique data handling capabilities such as SIMDs, and instruction latencies for different instructions. Optimizing code, therefore, begins with choosing the right algorithm for the target DSP.
An important factor that affects code optimization and throughput in both the C and assembly languages is instruction latency. Although an instruction can be issued to the functional units every cycle, the number of cycles before the output is available is typically long due to the deeply pipelined hardware architecture. When the code does not take this latency into account, the delay can considerably offset the advantage of hardware pipelining. To reduce the negative impact of long instruction latency and also to maximally utilize multiple functional units in the VLIW architectures, software optimization techniques such as loop unrolling and software pipelining need to be considered.
In Figure 1, the A, B, C and D are groups of instructions required for a loop. In the middle stage, known as the kernel, all the instructions in the A, B, C and D groups are executed in parallel. The opening stage, known as the prologue, sets up the execution of the kernel, and the closing stage, known as the epilogue, drains the execution. The prologue and epilogue stages are less efficient than the kernel due to the smaller number of functional units utilized. When there are sufficient iterations of the loop, so that the kernel continues at length, the performance advantages of software pipelining are enormous. An optimal performance would be achieved with a minimum iteration interval while functional units are fully utilized in the kernel.
While the C compiler and linear assembler perform software pipelining automatically whenever applicable, it is a programmer's responsibility to alleviate resource and recurrence constraints so that the tools can minimize the iteration interval. One can further explore more optimal instruction scheduling by manually creating a software-pipelined loop as in Code List 3 from above.
2. Resource constraints
Programmers must be aware of factors that can make the code resource-bound within the constraints of the architecture. Then, they must work to minimize the effect of these factors.
To begin with, it is important to note that some instructions must be executed on a specific functional unit while others can be executed on more than one unit, as shown in Figure 2. Therefore, optimal performance depends on the number of instructions required for each functional unit as well as the total number of instructions.
Another example is that the algorithm may use too many adds or shifts but not many multiplies, reducing the overall utilization of functional units. The programmer can balance execution by changing some of the instructions to multiplies, for instance, by manipulating a counter with a multiply instruction instead of an add instruction.
Redundancy is also an issue that must be addressed when dealing with constrained resources. Some algorithms repeat certain steps, and the programmer can find ways to eliminate these redundancies. For example, during each iteration of an FIR filter, a coefficient array and an input array are multiplied and the results are all accumulated. For the first output, each element in the input array is multiplied by its corresponding element in the coefficient array. For the second output, the input array is shifted and each element in the coefficient array is multiplied by the succeeding element in the input array. While the loop can load the data and coefficients for each output, it can reuse the data and coefficients already loaded, eliminating redundant load instructions.
Another interesting example is that a very short inner loop may not have a sufficient number of operations to be software pipelined through several functional units, for example the use of 3 functional units in a loop on a VLIW machine with 8 functional units. In this case, it may be efficient to unroll the loop so that it has enough instructions to fully utilize multiple functional units.
Code partitioning is also an important performance factor. Most VLIW DSPs are cluster-based in order to reduce the hardware complexity. Communication between clusters is performed via a limited number of crosspaths, which can easily become a bottleneck if instructions are not well partitioned over clusters.
Due to the cluster-based architectures, it is often necessary to rewrite blocks of code in assembly in order to perform partitioning effectively. The more each cluster can be programmed for independent operation, the more these exchanges can be eliminated, thus removing a potential source of slowdowns. The advantage of effective partitioning becomes even more apparent as the code block becomes larger and more complicated.
3. Recurrence constraints
Code execution can become bound by recurrences as well as resources. When there is a dependency between loop iterations, the next iteration cannot start until the current iteration finishes.
Scheduling instructions can also become more difficult if there are recurrence dependencies. These dependencies can make the scheduled execution of instructions take longer, offsetting the advantages of parallelism.
While some dependencies are not optional, others may be introduced unnecessarily. For these, the programmer can use a qualifier in the code to alert the compiler to the presence of a false dependency, such as a restrict keyword in C to avoid false memory dependencies.
When one set of registers is used for all loop iterations, a register value has a certain life time. In modulo software pipelining, therefore, the iteration interval cannot be less than the longest register life time. The compiler typically increases the iteration interval to handle this situation, which adversely affects the performance.
A potential solution is to manually insert move instructions to break up the longest register life time. Unrolling the loop can also help with this situation. While unrolling cannot change the required register life time, it can increase the iteration interval without penalty.
4. Nested loops
While the optimization techniques discussed so far apply to code in innermost loops, there are additional considerations for nested loops. Since the outer loop is typically not pipelined, it is considered overhead. Therefore, it is important to note that the length of the innermost loop can affect the overall performance.
To reduce the overhead of the outer loop for a short inner loop, one may consider merging the outer loop into the inner loop and conditionally executing the outer loop code. One may also try to overlap the inner and outer loops as much as possible to minimize the overhead of the outer loop execution. For example, designers can perform the outer loop during the prologue and epilogue of the inner loop. They can also overlap the epilogue of an iteration with the prologue of a next iteration.
In some cases, it may make sense to unroll the outer loop as well as (or instead of) the inner loop in order to reduce the number of instructions or to balance partitioning among clusters.
Wrap Up
The programming considerations presented in this article affect both C and assembly programming today on most high-performance DSPs. While tools such as C compilers and linear assemblers are available, they cannot know everything about the code that is necessary in order to achieve the highest level of performance optimization.
In the future, these tools are likely to become more sophisticated. For the time being, the available tools give software developers a great deal of power in programming today's high-performance DSPs.
However, programmers must be aware of how to use the tools to optimize their code. C compilers, profilers, optimizers and other tools all serve to simplify code development in highly complex machines. But knowing certain coding techniques such as minimizing resource and recurrence constraints, as well as optimizing the execution of nested loops, can enhance performance. Tools may have helped automate many aspects of DSP code development, but achieving top performance is still a programmer's art.
References
- K. Karadayi et. al, "Strategies for mapping algorithm to mediaprocessors for high performance," IEEE Micro, vol. 23, no. 4, July-August 2003, pp. 58 - 70.
- J. Brenner and M. Levy, "Code efficiency & compiler-directed feedback," Dr. Dobb's Journal, December 2003, pp. 59 - 65.
About the Authors
Chris Chung is a software applications engineer at Texas Instruments. He received a Ph.D. in electrical engineering from the University of Washington and can be reached at cchung@ti.com.
Xiangdong Fu a software applications engineer in TI's DSP/Catalog Emerging End Equipment Department. He has a Master's degree in Electrical Engineering and can be reached at xdfu@ti.com.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Developing memory-efficient DSP apps without assembly code
- How to exploit 17 tried and true DSP power optimization techniques for wireless applications
- Developing DSP code on converged hybrid DSP/RISC cores
- DSP optimization strategies using simulators and profilers
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience