Embedded Systems: Programmable Logic -> FPGA tools need hardware assistance
FPGA tools need hardware assistance
By Scott Hauck, Associate Professor, Department of Electrical Engineering, University of Washington, Seattle, EE Times
February 16, 2001 (1:13 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010216S0028
Reconfigurable systems make it possible to create extremely high-performance implementations for many different types of applications. While techniques such as logic emulation provide a new tool specifically for logic designers, many other FPGA-based systems serve as high-performance replacements for standard computers for everything from embedded systems to supercomputers. The creators of these implementations are often software programmers, not hardware designers. However, if these systems hope to be usable by software programmers, they must be able to translate applications described in standard software programming languages into FPGA realizations. Thus, mapping tools that can synthesize hardware implementations from C, C++, Fortran or assembly language descriptions must be developed. Although there are ways to transform specifications written in hardware-description languages into electronic circuits, translating a standard s oftware program into hardware presents extra challenges. HDLs focus mainly on constructs and semantics that can be efficiently translated into hardware (though even these languages allow the creation of nonsynthesizable specifications). Software programming languages have no such restrictions. For example, hardware is inherently parallel and HDLs have an execution model that easily expresses concurrency. Most standard software languages normally have a sequential execution model, with instructions executing one after another. This means that a hardware implementation of a software program is either restricted to sequential operation, yielding an extremely inefficient circuit, or the mapping software must figure out how to make parallel an inherently sequential specification. Also, there are operations commonly found in software programs that are relatively expensive to implement in hardware. This includes multiplication and variable-length shifts, as well as floating-point operations. Although hardware can be synthesized to support these operations, software that makes extensive use of these operations will result in extremely large designs. Finally, software algorithms operate on standard-size data values, using 8-, 16-, 32- or 64-bit values even for operations that could easily fit into smaller bit-widths. By using wider than necessary operands, circuit data paths must be made wider, increasing the hardware costs. Thus, because we are using a language designed for specifying software programs to create hardware implementations, the translation software faces a mapping process more complex than for standard HDLs. But software programming languages contain constructs that are difficult to handle efficiently in FPGA logic. And because of this such translating techniques can restrict the language constructs that can be present in the code to be translated. Most do not allow multiplication, division or floating-point operations; some ban the use of structures, pointers or arrays, eliminate recursion or do not support function calls or control flow constructs such as case, do-while and loops without fixed iteration counts. Some techniques, which are intended primarily to compile only short code sequences, may restrict data structures to only bit vectors, or not support memory accesses at all. Other techniques extend C++ for use as an HDL. It is relatively simple to translate straight-line code from software languages into hardware. Expressions in the code have direct implementations in hardware that can compu te the correct result (they must, since a processor on which the language is intended to run must have hardware to execute it). Variables could be stored in registers, with the result from each instruction latched immediately and one instruction executing at a time. However, this loses the inherent concurrency of hardware. We can instead combine multiple instructions, latching the results only at the end of the overall computation. Renaming is a standard compiler technique for making such code parallel, a variant of which is contained in the Transmogrifier C compiler developed at the University of Toronto. The compiler moves sequentially through a straight-line code sequence, remembering the logic used to create the value of each assignment. Variables in an assignment that comes from outside this piece of code draw their values from registers for that variable. Variables that have been the target of an assignment in the code sequence are replaced with the output from the logic that computes its valu e earlier in the sequence. For complex control flow, such as loops that execute a variable number of times or "if" statements containing function calls, control-state machines must be implemented. The code is broken into blocks of straight-line code. Each of these code segments is represented by a separate state in the control state machine. The straight-line code is converted into logic functions with the series of operations collapsed into a combinational circuit. Variables whose scope spans code segments are held in registers. The input to the register is a mux that combines the logic for assignment statements from different code sequences. Once the controller for the hardware is constructed, techniques can be used to simplify this state machine. States can be combined sequentially or in parallel, allowing greater concurrency in the hardware as well as minimizing the hardware cost of the controller. However, an even more important matter is the simplification of the data path, something th at hasn't yet been dealt with adequately. In the construction given above every operation in the source code generates unique hardware. For simple computations this is fine. However, complex opera-tions such as multiplication and division will be scattered throughout the source code, implying that a huge amount of hardware will be needed to implement all of these computations. But each multiplier will be used only within the state corresponding to that portion of the source code; otherwise, it will sit idle. The circuit's size could be greatly reduced if those hardware multipliers were reused in different places in the code. A single hardware multiplier can be used for many separate multiplication operations from the source code, as long as each occurs in different states. There are several different systems that convert code sequences in software programming languages into hardware realizations. While they all support the translation of control flow and data path operations into circuits, t hey differ on the amount of code they convert and the model of operation they assume. Perhaps the most straightforward is the Transmogrifier C system. It takes a complete program written in C and translates it into a circuit that can be implemented directly in the configurable logic blocks of Xilinx FPGAs. Special pragma (compiler directive) statements are included to declare external inputs and outputs, including the assignment of those communications to specific FPGA pins. This yields a system where the resulting hardware implementation is expected to be a complete self-contained system implementing the entire functionality of the desired behavior. Mapping all those instructions into FPGA logic means that the complete functionality of the entire program must be available simultaneously, using a huge amount of circuit space. However, we can implement only the most frequently used portions of code inside the FPGA, achieving a significant performance boost with only a small amount of hardware. During the execution of the program, the processor executes the software code until it hits a portion of the code im plemented inside the FPGA coprocessor. The processor then transfers the inputs to the function to the FPGA coprocessor and tells it to begin computing the correct subroutine. Once the FPGA has computed the function, the results are transferred back to the processor, which continues with the rest of the software code. An added benefit of the coprocessor model is that the software-to-hardware compiler does not have to support all operations from the software programming language, since complex functions such as multiplication or memory loads can be handled instead by the host processor. Although compiling only the critical regions of a software algorithm can reduce the area requirements and avoid hardware-inefficient operations, it does introduce problems unique to those types of systems. One problem is that some mechanism must be introduced for communicating operands and results between the processor and the coprocessor. For systems like Harvard's Prisc, which view the FPGA as merely a mechanism for increasing the instruction set of the host processor, instructions are restricted to reading from two source registers and writing one result register, just like any other instruction on the processor. However, other systems have much less tightly coupled FPGAs and require protocols between the two systems. In most of these systems the communication mechanism puts a hard limit on the amount of information that can be communicated and thus the amount of the computation that can be migrated to hardware. For example, if only two input words and a single output word are allowed, there is obviously only so much useful computation that can be performed in most circumstances. A second important concern with compiling only a portion of a program into hardware is determining which portions of the code to so map. Obviously, the code that gets mapped to the hardware needs to contain a large portion of the run-time of the overall algorithm if the designer hopes to achieve significant performance improvements. However, it is difficult to determine strictly from the source code where the critical portions of a program are. In general, the solutions are to profile the execution of the software on a sample input set, to find the most frequently executed code sequences, or to have the user pick the code sequences to use by hand. But simply identifying the most often executed code sequences may not yield the bes t speedups. Specifically, some code sequences will achieve higher performance improvements than others when mapped to hardware. Also, some code sequences may be too complex to map into hardware, not fitting within the logic resources present in the FPGA. In general, greater automation of this decision process is necessary.
Many research projects have developed methods for translating code in C , C++ , Ada, Occam, data parallel C, Smalltalk, assembly and Java, as well as special HDLs into FPGA realizations. These systems typically take software programs written in a subset of the programming language, translate the da-ta computations into hardware operations and insert multiplexing, latches and control state machines to recreate the control flow.
Most of the systems for translating software programs into hardware algorithms assume that only the most time-critical portions of the code are mapped. Those systems use the FPGA, or FPGAs, as a coprocessor to a standard CPU. The processor implements most of the program, handling much of the operations that are neces sary to implement complex algorithms, but which contribute little to the total computation time. The truly time-critical portions of the algorithm are translated into hardware, using the FPGA to implement the small fraction of the total code complexity that accounts for most of the overall run-time. In that way the strengths of both FPGAs and standard processors are combined into a single system. Processors can easily implement a large variety of operations by working through a complex series of instructions stored in high-density memory chips.
This does limit the portions of the code that can be translated into hardware. However, a system that converts the complete program into hardware must either convert those operations into FPGA realizations, yielding much larger area requirements, or ban those constructs from the source code, limiting the types of operations and algorithms that can be supported. Systems such as the Nimble compiler developed at Synopsys provide a middle ground, making effective use of both FPGA and CPU components for different portions of a given computation.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Sorting Out Embedded Vision Systems
- Secure Embedded Systems: Digging for the Roots of Trust
- Implementing Ultra Low Latency Data Center Services with Programmable Logic
- Hardware Acceleration for Embedded Computing
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience