Novel Techniques for Very High Speed Instruction-Set Simulation
Bangalore India
Abstract :
Instruction-set simulation is a well established method for a variety of uses: as tool for architecture exploration of next-generation architectures, as reference model for design verification, and as platform for software development. Emphasis on the utility of simulators in pre-silicon software development has continued to grow and we have witnessed the birth and adoption of virtual platforms that facilitate complete application software development on a desktop PC. This, coupled with the increasing complexity of SoC architectures, poses a significant challenge: can we simulate architectures at high enough speeds that allow full application development, debugging and analysis on a simulator to become a feasible and fruitful exercise?
In this work, we describe a set of novel & patented techniques to achieve very high speed instruction-set simulation. Techniques include execution-based dynamic code generation, granularity of simulation at basic blocks, efficient handling of delayed register updates, and efficient support for accurate interrupt recognition.
These techniques have been prototyped on the TI DSP TMS320 C64x CPU [9] architecture simulator and have demonstrated average speeds of 100 MIPS and top speeds in excess of 200 MIPS on standard off-the-shelf PCs. Further, they seamlessly work with existing interpretive simulators thereby facilitating significant reuse & integration into existing platforms & debuggers. This paper describes some of the key challenges to high speed simulation, a suite of techniques for addressing these, their novelty, and impact. Case study of the TI DSP TMS320C64xx CPU simulator and benchmark results are also presented to illustrate the impact of these techniques.
i. INTRODUCTION
Instruction-set simulation is being increasingly applied to address architecture exploration, design verification and software development. With the advent of virtual platforms, complete software development has become feasible on a desktop PC prior to silicon availability. Such virtual platforms have demonstrated ability to boot up operating systems, play audio streams in real-time and perform fast encoding & decoding of video streams. Going forward, it is anticipated that almost all the software (firmware, chip support packages, DSP algorithm libraries, operating systems, device drivers, application frameworks and end applications) would be developed, debugged and optimized on such virtual platforms.
Increasing demand for higher throughput from embedded electronic devices has driven increased integration of DSP processors, controllers, accelerators, custom logic and peripherals onto system-on-chip (SoC) architectures. Multi-core chips are fairly common in most embedded devices, including cell phones, digital cameras, set top boxes etc.
Simulating such multi-core SoCs at high enough speeds to enable realistic application workload execution is a huge challenge. Traditional interpretive simulators running at the order of a few millions of instructions per second (MIPS) no longer suffice. We need simulation platforms that can deliver higher orders of simulation speeds to enable use cases such as executing full regression suites, performing iterations of developdebug- optimize cycles, executing multi-layered end applications and so on.
In this paper, we present a set of novel & patented techniques to achieve very high speed instruction-set simulation. These techniques address multiple challenging requirements – achieving high speed on all classes of input applications, ability to handle self modifying code & overlays, full featured integration into debuggers & cycle accurate device simulators, retargetability to different hosts (eg: Windows-PC, Linux- PC, Solaris-Sparc), re-targetability of techniques to different ISAs, and low simulator development effort, & maintenance effort.
Our techniques, demonstrated on the TI DSP TMS320 C64x CPU [9] ISA, have delivered average speeds of 100 MIPS on audio and video codec benchmarks.
The reset of the paper is organized as follows: Section ii provides a background on existing high speed simulation techniques and the underlying technology that we use. Section iii provides a detailed description of the challenges involved in achieving high simulation speeds. Section iv describes the proposed simulation architecture, including a description of the novel execution-based code generation technique. Section v describes additional techniques that helped speed up the simulation and meet the requirements identified above. Section vi reports the TI DSP TMS320 C64x DSP ISA simulator speed against a set of benchmarks. We conclude in section vii.
ii. RELATED WORK
Techniques for high speed instruction-set simulation have been studied in the past and several publications describe prevalent techniques and their value. Broadly, all these techniques fall under the category of ‘compiled simulation’ [1]: the target program to be simulated is ‘compiled’ to an equivalent program that can be run on the simulation host (in our case, the desktop PC). Techniques of compilation vary greatly and may be classified into two categories: static translation [2] and dynamic translation [3]. Static translation involves onetime translation of the target program to host binary. Being static in nature, this method can not exploit runtime optimizations and does not integrate into scenarios that require code modification (eg: self modifying code, overlays). Dynamic translation performs run-time translation of the input target program to host executable code. By virtue of run-time translations, this method generally supports run-time code modifications and can leverage run-time information to speed up the simulation. Further, dynamic translation is performed in units of target basic blocks or larger chunks to allow optimizations at larger scope [4]. Sophisticated translation systems improve performance via dynamic block learning and translation of only the frequently encountered blocks [5], [6].
Dynamic translation may be further classified into two categories depending on the method of translation: direct host binary code generation and host source code compilation. Direct host binary code generation [4] involves transformation of (parts of) the input program directly to host executable code. Typically, such systems use in-built code generation techniques to emit efficient host code, avoiding use of an external host compiler. This method is commonly used because of its flexibility in controlling the nature of generated code. However, it suffers two significant drawbacks: portability to new host architectures (eg: moving PC simulations to Sparc workstations), and the need for sophisticated code generation technology within the simulation kernel.
Dynamic translation via host source code compilation requires use of an external host compiler that can translate an intermediate representation of the target binary to host executable. In such systems, the simulation kernel translates (parts of) the target binary to an intermediate representation (eg: ‘C’ code) that can be separately compiled to produce host executable code. This method provides sufficient flexibility to the simulation environment while avoiding the drawbacks of direct code generation mentioned earlier. However, the use of external compilation has associated run-time overheads and must be alleviated suitably. Typically, this is handled by keeping track of the most frequently executed basic blocks (or larger chunks) and invoking the compiler only when compilation threshold criteria are satisfied. This is the method that we use.
The problem of generation of source code from a specification of the functionality of the target instructions has been studied in [7] and [8], but with the use of specific templates to capture the instruction functionality. While the problem of generating efficient host binary directly (without use of host compiler) has been investigated, considerations for efficient dynamic source code generation and sophisticated simulation techniques for this variant of hybrid simulation have possibly never been well explored or documented. Our work explores this avenue and presents original results.
iii. CHALLENGES IN ACHIEVING HIGH SPEED SIMULATION
In this section, we motivate our techniques by describing some of the key challenges that must be addressed. First, we lay the ground by describing the structure of a typical interpretive simulator.
1. Overheads inherent in the interpretive simulator:
The typical interpretive simulator employs a dispatch-execute-update cycle for each instruction to be simulated. Pseudo code for this behavior is depicted below:
{
- Look up decode cache for instruction at program counter
- Call semantic function associated with instruction è register writes are posted to a delay queue
- Update program counter
- Perform updates from delay queue to register state variables to complete the semantics of dispatched instructions
Several limitations and overheads inherent with the above implementation may be noted:
- Execution of just one instruction per iteration of the loop (or, in multi-issue CPU’s, one execution packet per iteration). Large overheads are encountered in steps a., b, and d. above.
- Semantic function call is a function pointer invocation (step c.)
- Costly register write-back implementation (steps c. and e.)
- Input program characteristics are completely ignored.
2. Need for retargetability of the framework Any solution that overcomes the challenge outlined above must be retargetable along two directions:
- portability to other hosts (eg: from Windows PC to Solaris Sparc)
- support for multiple different target ISAs
3. Efficient support for delayed register writes For architectures with an exposed pipeline, as in the TI DSP TMS320C6000 [9], the behavior of delayed register writes must be obeyed in the simulator to ensure functional correctness. For example, in the code snippet below, the ‘load’ instruction updates the register ‘A1’ in the fifth cycle and the following ‘add’ instruction uses the old value of the register. This is an artifact of the exposed pipeline and is the desired behavior.
Ldw *A2, A1 ;load mem[A2] into A1; A1 contains the loaded value after 4 delay cycles
Add A1, A3, A4 ;add registers A1 & A3 and store the sum in A4.
Typically, delayed register writes are handled by use of delay queues to hold pending write-backs. Each cycle, the delay queue is advanced and ready write-backs committed. Delay queues are very expensive and generally impact simulation speed quite drastically.
4. Efficient support for register write-backs spanning basic blocks In some architectures (such as the TI DSP TMS320C6000), register write-backs may span basic blocks. A register write-back issued in one sequential code segment may have a long latency and take effect (i.e. update the destination register) in a different code segment. A simplistic solution is to revert to interpretation and use delay queues but this can be very expensive if such write-backs are encountered in inner-most loops of applications.
5. Accurate recognition of interrupts Support for interrupts is a necessary requirement in almost all simulation platforms since applications rely on interrupt functionality provided by agents such as timers, DMAs etc. Here, the main challenge is efficient recognition and handling of interrupts.
6. Support for self modifying code
7. Debugger support for interactive debug (run, step, halt, breakpoints, etc)
8. Low simulator construction effort
iv. PROPOSED FAST SIMULATION ENGINE ARCHITECTURE
We use dynamic code generation to achieve very high simulation speed. Its key features are:
- Dynamic host ANSI ‘C’ code generation
- Execution-based code generation
- Code generation at basic blocks (or larger) granularity to facilitate code optimization by host compiler
- Use of compilation thresholds to minimize external compilation costs at run-time
- Support for both interpretive and compiled simulation modes to allow flexible debug control as desired
- Support for delayed register writes, delayed branches, & register writes spanning basic blocks
- Cycle accurate interface to memory system
- Adaptors to retarget the simulation engine to different ISAs
Figure 1 depicts the top-level flowchart of simulator execution. Essentially, the simulation engine performs three key operations :
- Basic block identification
- Code generation & compilation
- Execution control
1. Basic block identification :
The engine identifies basic blocks of application code and also keeps track of their execution mode (described below) and execution count.
- Execution mode
- Pure: the execution mode of a basic block is said to be pure if the block can be translated to host code that does not require use of delay queues for register updates
- Mixed: the execution mode is said to be mixed if the block can be translated but with use of delay queues to hold pending register writes
- Interpreted: the mode is said to interpreted if the block can not (should not) be translated to host code. This mode is used if breakpoints have been inserted at location(s) in the block.
- Execution count
- The block’s execution count is incremented and compared with the compilation trigger threshold each time the block is interpreted. If the block execution count exceeds the trigger, then it kicks off a host code generation and compilation step.
2. Code generation & compilation :
In this operation, the engine emits host ‘C’ code, and executes an external host compilation process when compilation threshold criteria are satisfied. The most interesting aspect of this operation is the method of generating optimized host ‘C’ code, which we term ‘execution-based code generation’, described next.
- Execution-based host ‘C’ code generation: In this unique method, the semantic functions associated with individual instructions in the basic block are executed to generate host code. That is, the semantic functions are not statically parsed or analyzed, but actually executed in order to generate host ‘C’ code. The method involves replacing the native ‘C’ types (bool, char, short, int, float, double etc) and control operations (if, else, for, while, etc) in the semantic functions with C++ types and control handler functions respectively. This transformation to the semantic functions is easy to do and facilitated by use of macros. When the transformed functions are called during simulation, the C++ types and control handler functions emit host code. The code snippet below indicates the idea:
Original Code In semantic function | Transformed code Via macros | Host ‘C’ code generated By executing transformed code |
int32_t a,b,c; a=100; b=200; c=a+b; | c_int32_t a,b,c; a=100; b=200; c=a+b; | int32_t a, b, c; a=100; b=200; c=300; |
In the above snippet, note that the native 32-bit integer type (int32_t) has been substituted by the user-defined C++ type c_int32_t. This C++ class implements all the standard operators such as +, -, *, >>, &, | and several others. For example, execution of the assignment ‘a=100;’ invokes the operator= of the c_int32_t class. The last statement ‘c=a+b;’ causes operator+ to be invoked on objects a and b. Since these objects hold constant values, the operator returns the sum (in the example, the value 300).
This example above illustrates a powerful idea: using the C++ types and control function handlers, it is possible to execute the semantic functions to generate optimized host code. The optimization occurs by virtue of generation time constants as illustrated by the replacement of the string ‘a+b’ by the string ‘300’. In practice, there are several such generation time constants encountered during simulation: condition codes that are always true or always false, immediate constants embedded in instructions, branch displacements, various instruction field values etc.
As the semantic functions in a basic block are executed, the generation process emits code, using a pool of function-local variables to hold temporary results as well as pending register write-back values.
Further, the generation process tracks which variables are ‘unlocked’ (i.e, the values they hold have been consumed or written to registers) & when, thereby allowing for efficient implementation of delayed register updates, & the ability to transfer state from locked local variables to the traditional delay queue when interrupts are detected. These techniques are described further in the next section.
- Host compilation of generated code: In this step, all candidate blocks (those whose execution counts exceed compilation threshold) are selected and compiled. Any available host compiler (eg: gcc, or MSVC on Windows) is used for this purpose; the output of the compilation is a dynamically loadable module (DLL).
- State update: Once compiled, the compiled code (DLL files) is loaded into the simulation cache. Subsequently, when the PC reaches the starting address of one of these blocks, the simulation cache calls the pre-compiled function associated with the corresponding basic block, thereby eliminating the overheads of interpretation.
3. Execution control :
The simulation engine interleaves the operations of basic block detection, host code generation and application code execution in such a way as to minimize the costs of basic block book-keeping and host compilation. Further, it chooses the appropriate execution mode (interpretation, purecompiled function call, or mixed-compiled function call) at each basic block depending on the state of the ISA and command from the debugger. For instance, a debugger ‘step one instruction’ command always causes interpretive mode to be used, whereas the ‘run free’ command causes compiled functions to be called predominantly except around PC’s with breakpoints.
The engine architecture is designed to work with ISAspecific adapters to customize the code generation process. The architecture lets ISA adapters perform instruction decoding, identification of block boundaries, identification of cycle or phase boundaries, etc. This permits the ISA independent parts (execution based generation, host compilation, basic blocks management, simulation cache management etc) to be independently and commonly implemented across multiple ISAs.
Figure 1: Simulation engine architecture
v. OVERCOMING THE CHALLENGES
As discussed in the previous section, dynamic compilation at basic block granularity completely eliminates interpreter overhead for commonly executed code, thereby achieving significantly high speeds. We have achieved retargetability of the framework on both counts:
- the engine is host independent by virtue of generating ANSI ‘C’ code. It can easily be ported to any host platform.
- by separating the ISA-dependent parts from the independent ones, the engine has been commonly implemented for all ISAs
Before discussing handling of delayed register writes, we present the structure of the generated code. As indicated previously, an entire basic block is compiled into a host function that simulates it. All temporary variables accessed by the semantic functions in the block are mapped to local variables declared in the host function. The code generator tracks allocation and release of the locals so that the available pool of variables is appropriately managed. This provides the basis for efficiently implementing delayed register writes: a value to be written to a register at a later cycle (or phase) is stored in a local variable which is locked for the duration of the intervening cycles (or phases). At the appropriate cycle, an assignment statement to assign the value of the locked variable to the register is generated. This scheme avoids use of costly delay queues entirely.
The above treatment of delayed register writes does not, however, work in case of register updates that span basic blocks since values can no longer be held in local variables. Further, since we can not statically determine the caller-callee relationships between basic blocks, we can not use global variables either. The only solution is to use delay queues. Note that we still compile basic blocks but use delay queues to hold register writes instead of local variables (mixed mode).
Interrupt handling raises additional issues: how do we ‘bail’ out of compiled code and honor the interrupt accurately? How do we manage delayed writes that are held in local variables? The solution that we use is one of rigorous book-keeping: at code generation time, the generator produces additional tracking information on a cycle basis - sets of locked local variables and their lock durations. During execution of compiled code, if an interrupt were to occur in the middle of the generated code, the engine uses this tracking information to fill in the delay queue with set of pending register writes and switch to interpreted mode.
Self-modifying code is handled by monitoring data-path writes that update the program memory. If such a write occurs, previously compiled blocks corresponding to updated program memory are annulled.
Debugger support is seamless owing to support for interpreted mode. Standard functions such as running, stepping, halting, setting or removing breakpoints, and monitoring profile events are all supported. Interactive commands use interpreted mode. Batch commands use compiled mode.
Simulator construction using the new technology has been made easy by virtue of reusing semantic functions via macro substitutions; decode functionality is reused as-is.
vi. RESULTS
We benchmarked the techniques outlined above by constructing a full-fledged instruction set simulator for the TI DSP TMS320C64x and comparing the performance of this simulator against an existing, fully interpretive simulator. The table below summarizes the benchmarking results. The simulators were run on an off-the-shelf desktop PC, powered by Intel Pentium IV processor clocked at 2.4GHz with 1GB primary memory, running Windows XP operating system. The simulator speed was calibrated in MIPS. Calibrations were performed by executing the benchmark applications on the simulators unobtrusively.
Matrix Mpy | Dot Product | RSEMC codec | MPEG2 codec | |
Simulator speed (MIPS) of prototype | 240 | 230 | 120 | 100 |
Simulator speed (MIPS) of interpretive simulator | 8 | 8 | 8 | 8 |
Factor of improvement | 30 | 28.75 | 15 | 12.5 |
As may be observed, the techniques described in this work have enabled significant speed up as compared to traditional interpretive techniques. We have observed speed up anywhere between 12.5 times to 30 times on measured benchmarks, with absolute speeds regularly in excess of 100 MIPS.
vii. CONCLUSIONS
Our paper has demonstrated an engine for very high speed instruction-set simulation using the dynamic source code generation & compilation technique.
Execution-based code generation is a novel way of reusing existing interpretive code-base while also optimizing the generated code by virtue of generationtime constants. This technique supports efficient simulation of ISA complexities such as delayed register writes, delayed branches, register writes spanning basic blocks and accurate interrupt recognition.
Our engine is retargetable to any host platform. By separating the ISA-specific functionality from the common engine core, the engine framework has been made for easy adaptation into different ISA simulators. Thus, we believe that these techniques hold promise of providing the necessary performance thrust to simulation tools to enable full-fledged embedded software development for complex SoC devices.
viii. REFERENCES
1. Vojin Zivojnovic, Steven Tjiang and Heinrich Meyr, “Compiled simulation of programmable DSP architectures”
2. Simulation/evaluation environment for a VLIW processor architecture - by J. H. Moreno, M. Moudgill, K. Ebcioglu, E. Altman, C. B. Hall, R. Miranda, S.-K. Chen, IBM Journal of research and development, Performance Analysis and its impact on design, Volume 41, No. 3, 1997
3. A dynamic binary translation approach to architectural simulation – by Harold W. Cain, Kevin M. Lepak, and Mikko H. Lipasti, Tech report, University of Wisconsin, Madison
4. Bob Cmelik, David Keppel, “Shade: A Fast Instruction-Set Simulator for Execution Profiling”
5. Wai Sum Mong, Jianwen Zhu, “DynamoSim: A Trace-based Dynamically Compiled Instruction Set Simulator”
6. Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia, “Dynamo: A Transparent Dynamic Optimization System”
7. Achim Nohl, Gunnar Braun, Oliver Schliebusch, Rainer Leupers, Heinrich Meyr, Andreas Hoffmann, “A Universal Technique for Fast and Flexible Instruction Set Architecture Simulation”
8. Mehrdad Reshadi, Prabhat Mishra, Nikil Dutt, “Instruction Set Compiled Simulation: A Technique for Fast and Flexible Instruction Set Simulation”
9. spru189.pdf, dspvillage.ti.com, Texas Instruments TMS320 C6000 DSP CPU & Instruction set Reference Guide
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Connecting reality and simulation: Couple high speed FPGAs with your HDL simulation
- SoC silicon is first-time success following simulation and validation of novel array processor
- Integrating High Speed Serial Transceivers into an FPGA
- Multi-Gigabit SerDes: The Cornerstone of High Speed Serial Interconnects
Latest White Papers
- Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard