Custom processors rev Java execution
Custom processors rev Java execution
By Ashish Sethi, Product Manager, ARC International, Matt Kubiczek CTO and Cofounder, Digital Communications Technologies Cambridge, England, EE Times
April 1, 2002 (10:48 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020329S0016
Embedded Java implementations run up against performance problems because of the stack structures the language requires. Java implements a stack-based programming model, in which two stacks are required: the variable stack, which holds parameters passed between Java functions, and the operand stack, which holds parameters that are required by the method currently being executed. When programming in C, stacks are normally maintained in main memory and data is pushed or popped between the stack and the CPU registers. In standard software implementations of a Java Virtual Machine, or JVM, both Java stacks are also maintained in main memory. But many more memory transactions are required because data must be pushed and popped between the variable stack and the operand stack, and again between the operand stack and the CPU registers. It is clear that this adds further memory traffic and could severely hamper system performance. Stack op erations are an example of tasks that dedicated Javaxc processors do very well, but where RISC processors are inefficient. Also, translating Java byte codes into native machine language of a processor must be done efficiently to reduce the size of the resulting native code and to reduce the complexity of the JIT. This is seen as a significant challenge because RISC and Java processor architectures and Java processor architectures are vastly different. The fact that some Java byte codes are complex and relate to many RISC instructions only serves to exaggerate performance loss. Two routes have come to light that address the inefficiencies and challenges that developers face in accelerating Java execution in embedded systems. Preprocessing concerns Indeed, it could be called a Java preprocessor. Since the hardware translation block replaces the software translation in the JVM, Java performance is improved. Developers can expect between 5 and 6 CaffeineMarks/MHz from a technique such as this one. Although performance is enhanced and the cost of the solution at approximately 12k gates may sound appealing, there are hidden drawbacks in implementing a hardware translation mechanism. Complex Java byte codes are normally translated in schemes like these by trapping the incoming byte code and executing a routine, written in the native machine language of the RISC processor. The dynamic translation itself requires on-chip memory, similar to microprogrammed ROMs in CISC processors. This ROM can be quite large, so the 12k-gates measurement may give a deceptive view of act ual silicon area usage. In addition, the processor must switch from translating byte codes to running a RISC function. A more efficient, highly integrated implementation can be realized if a processor architecture that has been designed for extension is used. Processors like these allow designers to add custom logic to the processor core architecture, such as new CPU instructions, supplementary CPU registers and control registers. This is the approach that we have taken, by extending the CPU register file of the ARC tangent-A4 user-customizable processor and developing a twofold algorithm leveraging the hardware enhancements to create a high-performance RISC/Java processor at little cost.
The first approach attacks the translation problem. Some vendors have decided that an additional stage preceding the processor's instruction fetch can translate byte codes into RISC instructions as they arrive. This approach resem bles a coprocessor. However, because it precedes the processor and does not hoard bandwidth on the bus, it does not hamper system performance as much as a coprocessor.
This approach does not focus on implementing bolt-on hardware that translates incoming byte codes as they are fetched. Instead, it strives to adapt the RISC processor architecture and programming model to better reflect the needs of Java. In that way, translation can be accomplished easily and efficiently in software.
Because no large preprocessor is required, a large operand stack can be implemented instead of a small one. This operand stack is implemented as CPU registers and accessed using additional CPU register addresses that are not used by the standard RISC processor. Because the stack is larger than in the previous example, the chances of overflowing to memory are greatly reduced and performance is improved. Moreover, the variable stack can be combined with the operand stack to create a Unified Register Mapped (URM) stack. This greatly reduces the amount of memory accesses required as data is present in registers ready to be processed, rather than having to be loaded from main memory.
All in one step
The addition of the URM stack is fundamental to accelerating Java execution. It allows RISC instructions to directly manipulate data on the stack by accessing registers instea d of requiring completely separate stack operations. Data can be popped from the stack, operated on, and pushed back onto the stack in a single RISC instruction by referencing these registers.
The next step is to implement a translation scheme that efficiently utilizes the modifications made to the RISC architecture. Because the stacks are no longer held in memory byte codes, that push and pop data can be represented as RISC instructions that move data from register to register, rather than costly "load" and "store" instructions.
In this way, single byte codes can be mapped to single RISC "move" instructions. But because the stack combines both operands and variables and it is implemented in registers, there is no need to move data from one stack to another, as long as the software keeps track of what is happening.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Designing custom embedded multicore processors
- Creating Domain Specific Processors Using Custom RISC-V ISA Instructions
- Using sub-RISC processors in next generation customizable multi-core designs: Part 1
- Reconfiguring Design -> C-based architecture assembly supports custom design
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience