Application engine synthesis offers new design approach
Vinod Kathail
(10/29/2004 3:00 PM EDT)
What is an application engine and why is it necessary? Broadly defined, it is circuitry dedicated purely to the execution of a given algorithm or set of algorithms on a system-on-chip (SoC) device.
It is necessary because execution of these algorithms by the SoC's main or system processor often cannot be achieved within the application's performance, power and silicon cost constraints. Application engines are deployed even when the main processor has the capacity to execute the selected algorithms with the desired performance, because the processor's power consumption is often unacceptable. Indeed, according to a study by Bob Broderson of the University of California at Berkeley, dedicated hardware consumes one to two orders of magnitude less power than programmable approaches.
Such dedicated circuitry is not new. Nearly every SoC has manually-designed RTL blocks that "offload" the processor. But what is new is the exponential growth in both the algorithmic complexity of these engines and the sheer volume of data that they must process. These exponential increases are driven by the integration into the SoC of multiple complex functions such as video processing, imaging, audio, wireless and wireline communications.
Application engines have become critical functionality enablers in SoCs for complex consumer devices, implementing industry standard algorithms such as H.264, MPEG2, MPEG4, and MP3, and proprietary algorithms such as imaging pipelines in digital cameras and printers. Consequently, such application engines are critical to the success of an SoC platform-based design methodology, which enables rapid design of the SoC derivatives necessary to service the consumer market (see figure 1).
Unfortunately, application engine development using the established manual RTL design methodology is time-consuming, and often requires multiple re-spins to achieve the requisite results. Clearly, an automated approach is required.
Figure 1 — Platform-based SoC with integrated application engines
Application engine design
So, what are the technical requirements of an automated application engine design technology? We must start with an understanding of the design challenge with which application engine designers are faced. This challenge is partitioned into two distinct phases.
Firstly, from the reference algorithm — which is a functional description written in a high level language such as C, and independent of implementation — an algorithm must be developed in untimed C that is suitable for efficient implementation in hardware and/or software. This is a highly creative, non-mechanical and, unavoidably, manual task that produces what is essentially a critical piece of a company's differentiating intellectual property (IP). Such development can consume several engineer-months of effort.
Examples of the individual tasks to be undertaken include:
- Conversion of the algorithm description from floating point to fixed point arithmetic.
- Determination of the bit widths, which must be great enough to ensure output quality in, say, picture or audio, but small enough to deliver a cost-effective implementation.
- Determination of the appropriate memory architecture. The reference algorithm may use a single memory, while the implementation algorithm may deploy multiple local memories to increase performance.
Secondly, a strategy must be devised to realize the implementation algorithm with appropriate programmable and/or non-programmable circuitry — standard and/or custom. Multiple strategies may be necessary because an algorithm may be integrated into multiple products, each with its own performance/power/cost requirements.
For instance, a digital camera with some movie capability would use a high performance imaging pipeline for still photos, and a low performance MPEG encoder for the video. Conversely, a camcorder with some still photo capability would use a high performance MPEG encoder for the video, and a low-performance imaging pipeline for the stills. Same algorithms, different performance requirements and different implementations.
The established implementation approaches are to incorporate dedicated hardware blocks and/or deploy additional processors or DSPs customized for the application, or a combination of the two.
The dedicated hardware block design challenge varies enormously. For example, an encryption engine is a relatively simple block. By contrast, an imaging or video pipeline consists of multiple dedicated blocks that implement an algorithm containing multiple loops, the design of which involves far more than simply the design and assembly of individual blocks. For instance, the designer must ensure that blocks are rate-matched to minimize the size of inter-block buffers, and that they execute concurrently on different tasks to meet the requisite performance (see figure 2).
Figure 2 — Camera image pipeline: first and last blocks run 2x slower than the middle blocks to rate-match the pipeline.
By even greater contrast, a complex application engine such as an MPEG encoder contains a combination of multiple hardware accelerators, a processor and multiple memories. The accelerators execute the compute-intensive tasks, such as direct cosine transform (DCT) and motion estimation, which constitute about 15% of the code but which consume about 75% of the execution time.
The processor executes the remaining tasks, which may be single and/or multi-loop. Thus, the designer must use a standard processor or design a custom processor, and manually design the hardware accelerators. Manual design of such an application engine can consume several engineer-years of effort.
The manually designed dedicated hardware accelerator offers superior performance, but at a design productivity cost that can threaten time to market. The processor offers the flexibility of programmability, but can be power hungry, and might not meet high performance requirements at reasonable silicon cost.
Thus, the first requirement of an automated technology is that it must deliver the benefits of both approaches, and with a very high degree of automation. This is what Application Engine Synthesis (AES) technology does — and it does so in a matter of days or weeks.
How application engine synthesis works
AES technology uses configurable IP combined with configuration and exploration software tools to automatically design an application engine from the algorithm written in untimed C, together with its test data.
First, the AES technology analyzes the C description and identifies the parallelisms in the (sequential) C code that enable the application engine to achieve its performance objectives. The technology identifies parallelism at all levels, including inter-task parallelism, where a task can be commenced before a prior task has been completed; intra-task, where a block can commence a task before a prior block has completed it; inter-iteration, that is, multiple iterations of a single loop executing in parallel; and intra-iteration at the instruction level, that is, between multiple instructions executing in parallel on multiple functional units.
The AES technology then identifies and partitions the code to be executed on dedicated hardware accelerators and a programmable processor (see figure 3). The technology then allocates the code to two configurable architecture templates.
It allocates compute-intensive code to a pipeline processor array (PPA) template that builds dedicated hardware accelerators with the appropriate rate matching, and control code to a template that builds a very long instruction word (VLIW) custom processor. Such templates enable the re-use of pre-verified functional units, while skillful design of these templates ensures right-first-time timing closure and place-and-route.
Figure 3 — AES allocates code to multiple hardware accelerators and a VLIW processor
The user can explore the design space with multiple "what if" scenarios to obtain a range of implementation alternatives, from which may be selected the configuration that provides the optimal power/speed/gate count trade-off. The AES technology automatically generates synthesizable RTL, logic synthesis script, testbench, and software drivers to accelerate verification and integration into the SoC design, together with a SystemC interface that facilitates system-level simulation and validation.
Application engine verification is achieved through the use of the testbench and test vectors generated automatically from the C test inputs. The designer may also use random and user-defined stimuli to perform perturbation testing of corner-case conditions, such as buffer underflow/overflow and stall conditions. The AES technology can also automatically generate bit-accurate and cycle-accurate models, which — together with the SystemC interface — facilitate transaction-level simulation of the entire system.
Table 1 shows results for dedicated hardware accelerators on five application engines:
Table 1 — Comparative design metrics of AES and manual RTL design
It can be seen that complex algorithms can be implemented with, in worst case, the same performance and area as manually designed RTL, or with higher performance and less area. In all cases, the design time is orders of magnitude less than that of manual RTL design. The technology identifies the optimal configuration to meet the design objectives, and determines in advance whether or not the engine can be designed with the requisite performance/area/cost.
What are the alternatives to Application Engine Synthesis? There is no current alternative technology that synthesizes application engines consisting of an optimal combination of processor (if required) and hardware accelerators, from C algorithm descriptions. The alternatives generate either pure processors or pure accelerators.
For pure hardware accelerator design, second-generation behavioral synthesis technologies automatically create from C descriptions the RTL micro-architecture, such as datapaths and one or more state machines. These tools give the designer a degree of control that was lacking in first-generation tools, but the designer must still undertake the time-consuming design of the multiple implementation candidates necessary to identify the optimum configuration.
The methodology seems to be adept at creating small datapath processing blocks for sub-sets of an application engine, such as DCT. However, it is not suitable for the synthesis of a whole application engine consisting of multi-loop pipelines, potentially communicating with a processor, which is required to implement an MPEG2 algorithm.
For pure programmable hardware design, configurable and extensible processors suggest themselves as a possible — and very flexible — design approach. Using such processors, the designer or the compiler determines which instructions are to be implemented in the custom processor.
A single processor is adequate for an MP3 algorithm, but multiple processors are necessary for high performance video processing. The latter thus requires the designer to partition the algorithm over several processors, and link the processors together to create a solution, at the cost of power consumption and chip area.
How much power consumption and area? Analysis of one video filtering algorithm shows that such an implementation would consume more than an order of magnitude more power and nearly an order of magnitude more chip area than a dedicated accelerator delivering the same performance.
How to evaluate implementation approaches
How does the designer determine the optimum implementation approach and select the appropriate tools to support it? From the various examples given, it is clear that the range of possible applications, each with its unique trade-offs, renders "standard benchmarks" unusable.
A much more robust foundation for these complex decisions is an application-targeted evaluation process that measures results on a "realistic" benchmark circuit to provide the requisite data. How should an engineering team approach this evaluation process?
The process consists of four distinct steps: technology benchmark, technology review, supplier evaluation, and initial deployment.
The objective of the technology benchmark is simply to ensure that the technology can be used to design the application engine to specification requirements. At this stage the designer is not interested in how the technology works.
Typical benchmark criteria may include: confirm that the input language is truly algorithmic, and that the code is sufficiently portable to facilitate design space exploration; determine the technology's ability to synthesize, for example, multi-loop designs with conditionals and to implement them with multiple throughputs and clock frequencies; measure overall verification time; measure RTL performance and area in comparison with both manual methods and specification requirements, and determine how easily the RTL can be integrated with the rest of the SoC. Also, evaluate ease of technology integration into the team's established design flow.
If the technology passes the benchmark test, a technology review is undertaken to determine the reproducibility of results, the technology's limitations and its future potential. It is necessary to obtain details from a supplier's development engineer or engineering management about how the technology works, where it works best and where it doesn't, and gain a thorough understanding of the roadmap. This builds the credibility needed by the team to move to the next step — supplier evaluation.
Supplier evaluation also builds credibility as the team evaluates the technology supplier's commitment to customer success. Does the company have the technical depth to deliver on the roadmap? Is there a customer support culture (not just an 800 number!) to help when there is a problem?
Upon satisfactory completion of the supplier evaluation, the team moves to initial deployment, the objectives of which are to introduce the technology into the design flow with minimal disruption, and to achieve a fast return on investment to prove the capability.
The key here is to make the supplier work for his money. Challenge the supplier to help you complete a realistic design example within a specified time. Spend another two weeks to gain confidence in the technology, even if you are coming under end-of-quarter pressure from the supplier, and then make the go/no-go decision.
In summary, Application Engine Synthesis delivers the increased productivity and the reduced time to market and reduced risk necessary to ensure that application engines can be designed with the rapidity and confidence required for producing complex consumer devices.
Vinod Kathail, Synfora founder, CTO, and vice-president of engineering, founded the company after a long career at Hewlett-Packard Laboratories. There, he was most recently the R&D Program Manager and a principal scientist in the Compiler and Architecture Research (CAR) Program. He was also the Chief Architect of the PICO project at HP.
Related Semiconductor IP
- RISC-V CPU IP
- AES GCM IP Core
- High Speed Ethernet Quad 10G to 100G PCS
- High Speed Ethernet Gen-2 Quad 100G PCS IP
- High Speed Ethernet 4/2/1-Lane 100G PCS
Related White Papers
- High Level Synthesis of JPEG Application Engine
- H.264 encoder design using Application Engine Synthesis
- SoC offers new opportunities-and problems
- SOC: Submicron Issues -> Integrated approach lets new tech in
Latest White Papers
- New Realities Demand a New Approach to System Verification and Validation
- How silicon and circuit optimizations help FPGAs offer lower size, power and cost in video bridging applications
- Sustainable Hardware Specialization
- PCIe IP With Enhanced Security For The Automotive Market
- Top 5 Reasons why CPU is the Best Processor for AI Inference