Processor Know Thyself: moving beyond on-chip JTAG emulation
By Dan Rinkes, Embedded.com
Aug 8 2005 (9:30 AM)
Experienced programmers know you can't have too many tools in your debug tool box. While instrumenting code with debug variables and patches can help locate a certain class of bug, for many bugs, you need hardware assistance.
However, gone are the days of placing a logic analyzer between two points in a system, because the entire system may be embedded in a single piece of silicon, while in-circuit emulators (ICE) are expensive and difficult to attach to small form factor devices. Now, more than ever, it has become essential that debug facilities exist on-chip.
Integrating debug peripherals on processors is not a new idea. JTAG emulation has allowed pseudo real-time access to system registers for some time. However, for even simple operations such as memory reads and writes, JTAG uses the processor under test to execute code. Such features are intrusive, tainting pipelines, caches, and other system components, which may mask or even stimulate errors.
Integrated embedded emulation peripherals do not replace JTAG emulation. If anything, they can enhance the visibility a JTAG port can offer. Instead of using the processor core to execute functions, peripherals execute in parallel to the processor with complete access to system registers, memory, and executive control, resulting in non-intrusive visibility, increased performance, lower latency, and greater complexity of functions.
Stealth Access
It is the finesse with which tools are used that determines their real value. For example, a real-time data exchange (RTDX) debug peripheral writes and reads without halting the processor, allowing you to watch registers and memory address ranges as application code executes. This feature can be useful in generating real-time errors or tracking program execution.
In the case of errors, you can manually corrupt data or code to force corner or extreme cases, or jumpstart a debugging session by preloading registers and memory with a profile that is known to cause the error under test. To change an entire data set simultaneously, you can group data in an object or record referenced by a pointer, make changes to a second instantiation of the object, then adjust the pointer to the second instance.
For advanced program execution tracking, you can watch the program counter or instrument the code to adjust debug variables that describe the current status of the application. Depending upon how the peripheral is implemented, reads and writes won't flush the cache. For example, when you set a breakpoint, the instruction opcode is replaced with a breakpoint opcode. With some architectures, this will invalidate the entire cache. The resulting delay could potentially throw off delicate system timing, say if you were executing a very tight loop which must execute within a certain interval to meet real-time deadlines.
Note that RTDX is not a real-time feature per se; the read/write is made after a person presses enter, not based on a precise trigger. What is important is that real-time events are not affected by the read/write. This is critical for applications servicing real-time events.
Debugging the Real World
Just because the processor halts doesn't mean the real world stops as well. Consider running a motor or accepting data over a network connection. You might miss a data event for the network connection and thus lose data you wouldn't have if you hadn't halted the system. Alternatively, any voltages driving motors will continue to drive them when you halt the processor. Certainly, if you lose data from a network connection or a sample from an audio source, you can always resend it. With a mechanical failure, however, you can damage your prototype.
Consider two standard ways of interacting with mechanical components: software and hardware interrupts. For example, you can use a regularly timed interrupt to drive the head on a hard drive. Each time the timer triggers, the appropriate interrupt evaluates the head position relative to its destination and adjusts the motor voltage accordingly. If you halt the processor before stopping the head, the head may crash into the platter.
An example of a time-critical hardware interrupt is an emergency stop button that triggers a hardware interrupt to shut down all motors, etc. When you halt a processor, this interrupt is no longer serviced; press emergency stop and nothing happens. If halting your application leaves motors running, you still have the potential for an emergency and the need to shut the system down quickly in the way that only the emergency stop interrupt can.
Ideally, you might want time-critical interrupts to continue executing even when you've halted the processor. This is often the case when you're debugging on reliable hardware. You know the mechanical motors or network connections work, but since they run on the processor, they are halted when you halt the application. This can make debugging your application an order of magnitude more difficult because you have to figure out how not to disrupt the real-time, interrupt-driven part of the system.
In these types of cases, the option to continue to execute interrupts even when the application is halted can be quite useful. This is achieved with an embedded emulation peripheral that masks time-critical interrupts (Figure One). Being able to mask interrupts is important because there may be certain application-based timer interrupts you don't want executing.
Debugging a drive head’s code execution sequence
Some timer interrupts will reset themselves after executing. For example, this could result, for example, in the processor executing the next queued move for a drive head (see Figure Two, below). There are two task queues for moving the disk drive head. In QueueA it is necessary to sequentially move to track 5 (black), read current track (black)move, move to track 27 (black), and then read current track (black). In Queue B the task queue the following operations: unlock head (red) move head + (red), stop at track 5 (black), spin at reading speed (red) stop head (blue), and finally, spin at reading speed (blue).
For Queue A, halting the processor may cause the drive head to crash if the head is currently moving. If this interrupt is masked as a time critical interrupt, the interrupt will continue executing until the task is finished and the drive head is secure. If the drive head interrupt is driven by a timer, it will continue to execute tasks from the queue until the queue is empty. Because each task ends with the drive head secure, the task manager can safely halt after the completion of any task.
With more granular primitives, such as those in Queue B, tasks in red represent tasks where the drive head is left in jeopardy. The task manager must continue to execute tasks until the head is secure, represented by black tasks. If the application has failed to schedule all the primitives for a complex action, the last task in the queue will leave the drive head in jeopardy. The task manager must schedule a task which secures the drive head (shown in blue), then put the drive head back in its original condition when system execution is resume.
Alternatively, you may not be simply interested in finishing out the current task in the motion queue. Consider what primitives to which you've broken down motion. If the application commands "Move to x", this command resolves to a safe position with the head stopped. However, if your primitives are more granular – start, move right, move faster, stop, etc. – finishing the current task may still leave your system in jeopardy. You'll want to modify your task queue manager to continue executing the task queue until the head is out of jeopardy. With data, you might want to finish reading in the current block or packet to avoid a resend.
To do this, you'll need to make the task manager aware of the time-critical interrupt mask. When determining whether to begin the next task, check if the system is in jeopardy. If it is, execute the task. If not, then you can halt the task queue. If the task queue is empty, the task manager must queue a task that removes the system from jeopardy. When the processor resumes execution, the task manager returns the system back to the condition it was in when the queue was empty.
For some events, such as emergency stop, you probably won't need to modify your code unless the emergency stop doesn't itself resolve system jeopardy but instead sends a message to the application to take care of things.
Time-critical interrupt masking can also simplify hardware debugging. Consider a one second action. On a 100 MHz processor you'll have to hunt down the small amount of real-time code interspersed among approximately 100 million lines of application code. Using time-critical interrupt masking, you could freeze the task manager until you've queued up the tasks you want to debug.
You then mask for time-critical interrupts and release the task manager. The processor will be halted for the application but will still run the real-time code. Thus, all you'll have in the trace buffer is the real-time code that you want to debug. Of course, if the bug is caused by an unintended interaction between the application and interrupt, this technique won't reveal the problem. However, then you'll know it is not solely the interrupt at fault but rather an unintended interaction.
Complex Triggering
Another important peripheral is one that provides comparators to enable features such as advanced event triggering. Traditionally, to set a breakpoint, a debugger overwrites the instruction on which to halt with a breakpoint instruction. For simple breakpoints, this technique is sufficient. However, if the application code is burned in ROM and cannot be modified, or you want to execute a complex breakpoint, the breakpoint command is not enough.
For example, if you want to break on the 10078th execution of a line of code because that is when the error occurs, your code will halt 10078 times, pausing to increment and test a counter each time. Such testing is intrusive and may affect real-time events. Even more intrusive is a watchpoint, where you want to break when a particular register or memory address is modified within a particular value range. By having comparators that execute in parallel to the processor, complex breakpoints no longer halt the processor or affect the pipeline with a breakpoint and interrupt debug code.
You can also create inverted breakpoints. Say you have a variable owned by a function being modified elsewhere in the code. If you merely set a watchpoint, you'll have to sift through all valid modifications to the variable. By defining an inverted range – any code outside the function rather than inside it – you'll narrow the number of modifications you have to personally evaluate, increasing your efficiency.
Multiprocessor debugging
Debugging becomes more complex when you introduce multiple processors. A peripheral that allows you to monitor bus activity between two processors, such as in an MCU+DSP device, can resolve shared memory contention issues. In traditional debug environments, you can only see what was written to a memory location, not which processor made the write. Bus monitoring peripheral tracks the source of each memory access, providing the necessary information for the debugging environment to identify which processor made the write. This increased visibility adds complexity, so you'll need a debugger that can interleave the trace buffers between processors.
If a device supports bus monitoring, it probably also supports global breakpoints. With standard breakpoints, one processor can halt another processor only after a latency of several cycles. These processors are out of sync in regard to interprocessor communication, potentially aggravating debugging by requiring you to reset both processors, and your application, to resync. Global breakpoints halt both processors on the same cycle.
Hunting errant code with self-trace
A key issue for developers is hunting down errant code that causes execution to "run into the weeds", so to speak. Often you can send trace data on an emulation port to a PC for offline evaluation, but given the high processor speeds, you may need to filter what you send since you can send only a limited amount of data per clock cycle. Often the information you need either was not collected or was pushed out the back of the buffer if the buffer is not "infinite" (ie, a storage device). To find your bug, you need a specialized trace.
Specialized trace peripherals buffer certain types of useful information. For example, a discontinuity trace will track the most recent branches, as well as provide an accurate measure of the number of cycles actually used, reflecting cache and pipeline efficiency. Tracking the gross movements of the program counter enables you to trace code execution using much less information than a full instruction trace requires. If you find the program counter in a place it shouldn't be, you can see where the code veered off.
Another useful technique is tracking jumps to un-initialized memory. First, write NOPs throughout un-initialized blocks of memory. Set the final instruction word as a breakpoint. In this way, a branch to any part of un-initialized memory will fall through to the breakpoint. You can then look back through the discontinuity buffer to discover the errant jump. Consider leaving this capability enabled in deployed devices. When the breakpoint is executed, write the specialized trace buffers to non-volatile memory, as well as any important system variables. You now have a record of invaluable debug information for hunting down intermittent bugs.
The new generation of processors brings new performance capabilities which make debugging that much harder. To address these new barriers, processors manufactures have been adding parallel debug capabilities to devices, enabling a new class of debugging techniques that promise to help developers get home on time.
Dan Rinkes is a software systems engineer at Texas Instruments, Inc.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Tips on using CPLDs to reduce system processor power consumption
- VLSI Based On Two-Dimensional Reconfigurable Array Of Processor Elements And Theirs Implementation For Numerical Algorithms In Real-Time Systems
- Implementation of the AES algorithm on Deeply Pipelined DSP/RISC Processor
- SAS--SATA: What You Need to Know for 6 Gb/s and Beyond
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience