Customized DSP -> VLIW calls for special debugging
VLIW calls for special debugging
By Mohammad Ayub Khan, Vice President, Software and Systems Engineering, TriMedia Technologies Inc., Milpitas, Calif., EE Times
March 12, 2001 (1:41 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010312S0091
Debugging applications based on a very long-instruction word (VLIW) DSP-CPU-based multimedia processor is a major challenge because the system designer is exploiting instruction-level parallelism (ILP). Instruction-level parallelism is the parallel execution of several instructions on a processor and is realized via the technique of multiple instruction issue. This level and type of debugging technique is unlike that used on conventional architectures. On top of this, the debugger is required to support both hosted and standalone environments, since the application is running on an embedded system processor. In standalone mode, the debugger has to operate with JTAG, USB or parallel ports, for example. It also must operate on different platforms, such as Windows or Unix. The granularity at which the designer can debug is a function of the compiler's internal representation. Take a decision tree, for instance. It can be one statement, a group of statements or an entire function. If the compiler is highly aggressive in optimizing a particular function, the entire function can be a singular decision tree. Further complicating debugging, the VLIW processor concurrently runs such peripherals as variable-length decoder (VLD), image co-processor (ICP) and synchronous serial interface (SSI), video input (VI) unit, video output unit. Moreover, the overall debugging process in this case encompasses the entire application to include the DSP-CPU core and the peripherals use d by the application. Ideally, then, the designer wants access to the peripherals in a seamless fashion. Hence, he or she can efficiently and directly control such operations as video in, audio out, SSI, ICP, VLD and PCI, for example, and in so doing, monitor and be able to change the behavior of those peripherals in real-time. A VLIW multimedia processor such as the TriMedia microprocessor issues five operations every clock cycle. Two of them could be custom operations, which in turn could result in multiple operations issued to functional units. This set of operations issued in parallel is determined at compile time and not while executing a program. Consequently, the VLIW processor has considerably less control logic than superscalar processors and consequently runs at higher clock speeds. The microprocessor also has 128 general-purpose registers and these four special-purpose registers: destination PC (DPC), source PC (SPC), clock count (CCount) and process control status word (PCSW). SP C and DPC registers are support registers for exception processing. DPC is updated during every interruptible jump with the target address of that interruptible jump. If an exception is taken at an interruptible jump, the value in the DPC register can be used by the exception-handling routine as the return address to resume the program at the place of interruption. SPC register is updated during every interruptible jump that is not interrupted by an exception. Thus, on an interrupted interruptible jump, the SPC register is not updated. The SPC register allows the exception-handling routine to determine the start address of the decision tree that was executing when the exception was taken. The debugger has two components-a host side and a target side. Those two components are further divided into different independent modules. Host-side debugger, in theory, is similar to conventional single machine-based debuggers with few important differences. Because the debugger supports both stan dalone and hosted environments, it provides consistent API to underlying low-level drives like PCI and JTAG. The target side is where all similarities to conventional debuggers vanish. The target side consists of a debug monitor optionally hooked up with an RTOS-or real-time operating system-monitor. The RTOS monitor hides the detail of the underlying system, making the target side RTOS-independent. The RTOS monitor provides such services to the debug monitor as starting/stopping of individual tasks, context saving and restoring and data representing the states of different resources like queues, semaphores and events . The debugger's architecture from a communication viewpoint is divided into three parts: debug monitor, debug front end and communications module. Included are such hardware communication channels as the PCI bus and TriMedia system bus. Also, there is communication among software components, as well as optional hardware/software components like a JTAG interface card or module o r a real-time operating system (RTOS). The communication module on the host side has three layers: the application-level protocol (AL), transport-level protocol (TL) and data-link-level protocol (DL). The debug monitor provides such functions as setting and removing breakpoints in a program, stopping and continuing execution, examining and changing instruction and data memory, registers, PCSW and others. It communicates asynchronously with the debug front end via the communication module. Using built-in hardware debugging support, the monitor realizes this functionality through a set of software routines running on the DSP-CPU core of the VLIW multimedia processor. The monitor routines run as interrupt service routines, which control a debugged task possibly, but not necessarily, through an RTOS. The monitor and front end communicate via an asynchronous communication protocol. The VLIW processor C compiler front end generates a parallel intermediate representation of a program known as decisi on trees, which are derived from basic blocks. A basic block is a sequence of instructions with no jumps into it, except at the first instruction and no jumps out of it, except at the last instruction. Basic blocks are connected by conditional or unconditional jumps. A decision tree is similar to a basic block in the sense that the decision tree can be entered only at the beginning. However, a decision tree can have multiple exits. Decision trees are larger than basic blocks and have the potential for more fine-grain parallelism and optimizations. Cont rol flow from one decision tree to another is handled by the scheduler using interdecision treejump operations. The scheduler is free to rearrange the operations of a decision tree, subject only to preserving data dependencies and the ordering among loads/stores. Guarding of operations enables the scheduler to eliminate branches where possible and to group operations belonging to different branches as straight line code. As stated earlier, the VLIW processor in this case has a large number of pipelined functional units, a large register set and other processor states. Interrupting at arbitrary points necessitates saving a large amount of the processor's state, which affects the processor's critical path, degrading performance. Since the processor's state cannot be preserved without degrading performance, there is no provision in hardware for saving the state of the processor at arbitrary points. This means the computation within a decision tree cannot be interrupted safely. The VLIW processor thus re sponds to special events such as exceptions and interrupts only at decision-tree boundaries. The instruction scheduler indicates those boundaries by the use of interruptible jump operations. Granularity of decision trees can be made small via compiler options for debugging and for minimizing interrupt response times. At the final state or last cycle of execution of a tree-to-tree jump operation, the VLIW processor's state can be fully described by a subset of the general-purpose registers and the special-purpose registers, PCSW, SPC and DPC. PCSW shows the status of various flags; SPC contains the start address of the decision tree in which the jump originated; and DPC contains the destination program counter or the target of the jump operation. Also worth noting, since the processor cannot be interrupted within a decision tree, single-stepping the processor's execution takes the granularity of a decision tree. However, the system designer can transfer the processor's state to a simulator, which prov ides finer granularity for single stepping, and hence he or she can perform single-stepping on it. Debug front end The communications module specifies the interaction between the debug front end and the debug monitor. The debug monitor runs on the target system, which can be a Tri- Media board plugged into a PCI slot, a standalone system or a TriMedia machine simulator. The debug front end interacting with the designer or programmer runs on a host system such as a PC or a Unix workstation. The communications module is specified in a number of layers along the lines of an abridged open systems interconnection model. The main reason for the layered approach is that the application-level and transport-level protocols can be designed and implemented independent of the actual physical link between the target and the host. The data link layer may change depending on whether the access is via PCI bus or JTAG. The application-level protocol specifie s the services provided by the debug monitor running on the target processor to the debug front end running on a host processor. This protocol is implemented via calls to the next layer in communication, the transport layer. The transport layer provides reliable data transfer between the host and the target. It breaks messages into packets, then transfers and reassembles them. The actual data transfer occurs at the data link layer, which links with the data communication hardware.
Breakpoints can only be set at decision tree boundaries and not any arbitrary statements. The execution of a decision tree can be viewed as an atomic operation. Debugging thus becomes significantly challenging due to the number of concurrent operations. The debugger designer also has to contend with the 27 functional units of the VLIW processor, which have different states, and the status of the state of the machine must be maintained.
The instruction-scheduling phase of the C compiler converts the parallel intermediate format code into packed instructions ready for the assembler. Certain restrictions exist in the choice of the operations that can be packed into an instruction. For instance, in the VLIW processor CPU, no more than two load/store class operations can be packed together. The instruction scheduler works on one decision tree at a time.
The debug front end interacts with the user, loads the program to be debugged, accepts debug commands with symbolic names and addresses from the user, translates symbolic names to machine addresses, calls on the debug monitor to implement the commands and displays the process state. It interacts with the programmer through a simple command line interface as well as the graphical user interface with customizable menus of debug commands, multiple windows for displaying processor state, instruction and data memory, stack frames, selection and modification of instruction and data locations, simultaneous display of source code, intermediate code and assembly code. It is important for designer and programmer alike to know that the processor's DSP-CPU core uses a compressed instruction format in which the bits corresponding to an instruction may be scattered within a 256-bit range. The modification of an instruction, in general, requires some decompress/compress functionality in the target-specific part of the debugger front end.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Customized DSP -> Wider DSPs enrich comms design
- Customized DSP -> Flexible compression key to audio
- Customized DSP -> Parallel DSPs: speed at a price
- Customized DSP -> Applications take the driver's seat
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience