Tips for efficient embedded code generation
Tips for efficient embedded code generation Compiler optimizations allow for the development of comparatively less expensive products with more features. An understanding of the practical compiler optimizations allows the developer to write the most efficient code, and to know where to focus his time in improving the his product’s performance. Following are a number of coding examples and tips for efficient code generation on compilers. Choice of Data Type Sizes: In practice, using a smaller type for a local variable is computationally inefficient. Consider the following snippet of code: int m; In PowerPC assembly code, this will look like: addi r4, r4, 1 ; add 1 to c On a CISC chip like the 80386, it will look like: add $1, %al ; add 1 to c On a RISC chip, one pays the price after writing to a sub-integral variable – the value must be sign or zero extended in register to make sure that the high bits of the register are consistently set. On a CISC chip, one pays the price when mixing the short variable with a larger integer variable. And unless you’re using an 8 or 16-bit microprocessor, the microprocessor doesn’t save on the shorter add instruction, so there is nothing to make up for the cost of the sign or zero extension after the increment. Variables in memory are sometimes different. Since memory is scarce, it often makes sense to conserve by using a smaller data type. Most microprocessors can load and extend the short value in a single instruction, which means that the extension is free. Also, the truncation that is necessary when storing the value can often be avoided since sub-integral store instructions generally ignore the high bits. Larger data types should also be used sparingly. For example, C99 adds the long long type, and this type has been available as an extension to C89 in most compilers for years. On a 32-bit microprocessor, these variables take up two registers and require multiple operations to perform arithmetic on them. An add or a subtract can usually be done inline in a few instructions, but an innocuous looking shift can turn into quite a few instructions. And you can forget about division! Of course, these variables may be necessary or convenient when coding, but don’t use them unless you need them and understand the corresponding code size and speed impact. Variable signedness: Characters deserve special mention here. Many architectures, such as the ARM, and PowerPC, are more efficient at loading unsigned characters than they are at loading signed characters. Other architectures, such as the V800 and SH, handle signed characters more efficiently. Most architectures’ ABIs define the plain “char” type as whichever type is most efficient. So, unless a character needs to be signed or unsigned, use the plain char type. Use of access types: This code can be optimized by the compiler in several ways because the functions were declared static. First, the compiler can inline the call to allocate_new_unique_number() and delete the out-of-line copy, since it is not called from any other place. Second, if the compiler has basic information on the external allocate_other_fields() function, it can sometimes tell that the function will not call back into this module. This allows eliminating the second load of unique_number embedded in the in-lined allocate_new_unique_number() function. On the other hand, “static” data should be avoided whenever possible for function-local data. The data's value needs to be preserved between calls to the function, making it very expensive to access the data, and requiring permanent RAM storage. The static keyword should only be used on function-local data when the value must be preserved across calls or for large data allocations (such as an array) in which the programmer prefers to tradeoff consumption of stack space for permanent RAM. Global Variable Definition For example, if the global variable “glob” is used the most in file.c, it should be defined there. Use a definition such as: int glob; or: int glob = 1; in addition to declaring it (with extern int glob;) in a header file so that other modules can reference it. Some programmers get confused about uninitialized variable definitions, so it is worth clarifying how they are often implemented. Most C compilers support a feature, called common variables, where uninitialized global variable definitions are combined and resolved at link-time. For example, a user could place the definition, as follows: int glob; in a header file and include this header file in every file in the project. While a strict reading of the C language implies that this will result in link-time errors, under a common variable model, each module will output a special common reference, which is different from a traditional definition. The linker will combine all of the common references for a variable, and if the variable was not already defined elsewhere, allocate space for the variable in an uninitialized data section (such as .bss) and point all of the common references to this newly allocated space. On the other hand, if the user employs a definition such as: int glob = 1; in one module, all other uninitialized, common references would resolve to this definition. It is best to write code without defining the same uninitialized global variable in multiple modules. Then, if you turn off the common variable feature in the compiler, the compiler is able to perform more aggressive optimizations since it knows that uninitialized global variable definitions are true definitions. Volatile Variables 1) The variable is a memory mapped hardware device where the value can change asynchronously and/or where it is critical that all writes to the variable are respected. 2) The variable is used in a multi-threading environment where another thread may modify the variable at any time. The alternative to careful use of the volatile keyword is to disable so-called "memory optimizations". Effectively, all variables will be treated as volatile when this option is chosen. Since this disables many optimizations, developers are encouraged to choose a less conservative approach. Separate threads often perform separate functions. As a result, they may have many variables and data structures that are not accessed by other tasks. Good software engineering practices may be able minimize the overlap to a few shared header files or. If such practices have reduced the scope of the code to a few files, it is then more feasible to find the variables and data structures that are shared between threads. Const Restrict: Also, consider the inner loop for a PowerPC target: Better code could be generated by pulling the first load out of the loop, since in[i] does not change within the inner loop. However, the compiler can't tell that in[i] doesn't change within the loop. The compiler is unable to determine this because the in and out arrays could overlap. The restrict keyword helps in cases like these. If the function declaration is changed to: void fir_and_copy(int *in, const int *coeff, int *restrict out) the compiler now knows that writes through *out cannot change the values in * in. The restrict keyword is a bit confusing since it applies to the pointer itself rather than the data the pointer points to (contrast const int *x and >i> int *restrict x). This is because some interesting conceptual problems developed with the first attempt to add this feature to the C language - the noalias keyword that was considered for the C 1989 specification. Dennis Ritchie, one of the progenitors of the C language, pointed out some of these problems, and noalias died in committee. It's reincarnation in the C99 restrict keyword, while conceived independently of Ritchie, has received his blessings. Pointers and the & operator Taking the address of a variable forces the compiler to allocate the variable on the stack, all but assuring less efficient code generation. Passing an argument as a C++ reference parameter has the same effect. Declaration scope of variables The compiler could potentially perform a tail call for the call to func2(). This is an optimization where the frame for func(), if any, is popped, and the last call instruction to func2() is replaced by a branch to func2(). This saves the need to return to func1(), which would immediately return to its caller. Instead, func2() returns directly to func1()’s caller. Unfortunately, the compiler cannot employ this optimization since it cannot determine that loc will not be used in the second call to func2() (which is possible if its address was saved in the first call). It would be better to write the code as follows: In this case, the compiler knows that the lifetime of loc ends before the final call – and the tail call, at least in principle, can happen. Another benefit of using inner scopes is that variables from non-overlapping scopes can share the same space on the stack. This helps to minimize stack use and can result in smaller code size on some architectures. However, it is usually not worthwhile to create artificially small scopes simply to bound the lifetimes of variables. Floating Point Arithmetic Unless doubles are implemented in hardware, it will be more efficient to do arithmetic with the single precision type. If this is the case and if single precision arithmetic is sufficient, follow these rules: 1) Write single precision floating point constants with the F suffix. For example, write 3.0F instead of 3.0. The constant 3.0 is a double precision value, which forces the surrounding arithmetic expressions to be promoted to double precision. float foo(f) instead do: float foo(float f); This is probably only a concern for old code bases. Variable Length Arrays Low Level Assembly Instructions Inlined assembly code uses non-portable syntax and compilers generally make over-conservative assumptions when encountering inlined assembly, thus affecting code performance. Intrinsics can be #define’d into other names if switching from one compiler to another. This can be done once in a header file rather than going through the code to see every place where inlined assembly was used. For example, instead of writing the following code to disable interrupts on the PowerPC: asm(“mfmsr r3”); One can write: __SETSR(__GETSR() & ~(1 << (31-16))); Manual Loop Tricks is sometimes manually transformed into: Such transformations are usually only effective under fairly specific architectures and compilers. For example, the 32-bit ARM architecture supports the post-increment addressing mode used above, but the PowerPC architecture only includes the pre-increment addressing mode. So, for the PowerPC, this loop could better be written as: As a general rule, only do manual transformations for time critical sections of your code where your compiler of choice has not been able to perform adequate optimizations on its own, even after adjusting compilation options. Write simple code for most cases and let the compiler do the optimization work for you. Greg Davis is Senior Engineer / Engineering Manager, Compiler Development at Green Hills Software, Inc.. This article is based on material developed for Class 411 on compiler optimization presented at the Fall 2005 Embedded Systems Conference.
By Greg Davis, Embedded.com
Sep 18 2005 (21:00 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=170704298
The C language’s most fundamental data type is the integer. It is normally defined as the most efficient type for a given processor. The C language treats this type specially in that the language does not operate on smaller types, at least conceptually speaking. If a character is incremented, the language specifies that the character is first promoted to an integer, the addition is performed as an integer, and then the truncated result is stored back into the character.
char c;
m += ++c;
clrlwi r4, r4, 24 ; truncate result to 8 bits
add r3, r3, r4 ; add the new c value to m
movbsx %al, %ebx ; sign extend to an integer
add %ebx, %edx ; add the sign extended value to m
Unsigned variables can result in more efficient generated code for certain operations than using signed variables. For example, unsigned division by a power of two can be performed as a right shift. Most architectures require a few instructions to perform a signed divide by a power of two. Likewise, computing an unsigned modulus by a power of two can be implemented as an AND operation (or a bit extract). Computing a signed modulus is more complicated. In addition, it is sometimes useful that unsigned variables can represent just over twice as many positive values as their signed counterparts.
For global data, use the static keyword (or C++ anonymous namespaces) whenever possible. In some cases, this allows a compiler to deduce things about the access patterns to a variable. This also "hides" the data, which is generally a good thing from a programming practices standpoint. Declaring a function to be static is also helpful in many cases as in the following example:
Compilers can sometimes optimize accesses to global variables if they are defined and used together in the same module. In that case, the compiler can access one variable as an offset from another variable’s address, as if they were both members of the same structure. So, all other things being equal, it is worthwhile to define global variables in the modules where they are used the most.
The volatile keyword tells the compiler not to optimize away accesses to the variable or type. That is, the value will be loaded each time it is needed, and will be stored each time it is modified. Volatile variables have two major uses:
The const keyword is helpful in a couple of ways. First, const variables can usually be allocated to a read-only data section, which can save on the amount of RAM required if the read-only data is allocated to flash or ROM. Second, the const keyword, when applied to a pointer or reference parameter, allows the compiler to recognize certain cases when a variable is guaranteed not to be modified through the pointer by a function call.
The restrict keyword tells a compiler that the specified pointer that is the only way to access a given piece of data. This can allow the compiler to optimize more aggressively. Consider the following example, a simple finite impulse response code:
It is usually more efficient to have a function return a scalar value than to pass the scalar value by reference or by address. For example, instead of passing the address of an integer for a function to write to, as in this code:
Declare a variable in the inner-most scope possible, particularly when its address needs to be taken. In such cases, the compiler must keep the variable around on the stack until it goes out of scope. This can inhibit certain optimizations that depend on the variable being dead. For example, consider the variable “loc” in the following function:
Understanding the rules of arithmetic promotion can help you avoid costly mistakes. Many embedded architectures do not implement floating point arithmetic in hardware. Some processors implement the single precision “float” type in hardware, but leave the “double” type to floating point software emulation routines.
2) Use single precision math routines, such as sinf() instead of sin(), the double precision version.
3) Avoid old-style prototypes and function declarations since these force floats to be promoted to double. So, rather than:
float f;
The variable length array feature, which is included in C99, may result in less efficient array access. In cases where the array is multi-dimensional and subscripts other than the first are of variable lengths, the resulting code may be larger and slower. The feature is useful, but be aware that code generation may suffer.
Sometimes, it is helpful or necessary to use specific assembly instructions in embedded programming. Intrinsic functions are the best way to do this. Intrinsic functions look like function calls, but they are inlined into specific assembly instructions. Check with your compiler vendor’s documentation to determine which intrinsics are available.
asm(“rlwinm r3, r3, 0, 17, 15”);
asm(“mtmsr r3”);
Sometimes programmers feel compelled to manually unroll loops, perform strength reduction, or use other transformations that a standard compiler optimizer would handle. For example:
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Paving the way for the next generation audio codec for the True Wireless Stereo (TWS) applications - PART 1 : TWS challenges explained
- Paving the way for the next generation audio codec for True Wireless Stereo (TWS) applications - PART 2 : Increasing play time
- Paving the way for the next generation audio codec for True Wireless Stereo (TWS) applications - PART 3 : Optimizing latency key factor
- Paving the way for the next generation audio codec for TRUE Wireless Stereo (TWS) applications - PART 4 : Achieving the ultimate audio experience
Latest White Papers
- Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard