In embedded, CPU progress is evolutionary

Ron Wilson
(05/17/2004 9:00 AM EDT

  SAN MATEO, Calif. — In the processor world, the major architectural innovations happen in workstation CPUs, and embedded processors follow from a great distance. Evolution, not revolution, marks the progress of embedded CPUs. And the Embedded Processor Forum in San Jose today will see more than its share of gradual change.

The name of the game in the embedded world is to pick and choose among the innovations pioneered in the workstation world, find the ones that most nearly fit the needs of embedded design and adapt them to this entirely different environment. This year, the needs most in evidence are the twin struggles to control power and to push integration.

A look at three processor designs that will be described at EPF, all intended to be synthesizable cores, shows what variety can be achieved from these two driving requirements.

ARC International plc (San Jose, Calif.) will reveal details of the ARC 700 core officially unveiled in February. The processor design continues ARC's commitment to instruction-set expansion and data path configurability while at the same time pursuing both high performance and a more central role in embedded designs.

In the pursuit of integration, said ARC vice president of technical marketing David Fritz, system-on-chip designers are looking squarely at the proliferation of processors on their dice. Often, these processors have appeared by accretion: as new blocks have been integrated into the ever-expanding SoC, they have not been redesigned, but simply incorporated whole, right along with their embedded-processor cores. This has led to SoC architectures that contain a scattering of sometimes a dozen CPU cores, each one with a fixed application and more headroom than it needs for its specific tasks.

By combining these tasks into a single multitasking CPU core of sufficient performance, Fritz said, designers see a way to reduce transistor count and power significantly. But that usually means the central processor must support a multitasking real-time operating system to make the combining of the software tasks tractable.

Taking a similar measure of licensee needs, ARM Ltd. (Cambridge, England) will introduce a different solution to the same issues. The company will describe a configurable block that can contain from one to four ARM-11-36-derived CPU cores in a multiprocessing configuration. The cores are surrounded by the buses, critical control paths, memories and logic necessary to support either symmetric or asymmetric multiprocessing operation.

This gives SoC designers, for the first time, a way to implement a whole multiprocessor block as practically a black box, said Dave Steer, ARM's director of segments for North America. The cores and other blocks are supplied in synthesizable form, but by providing the whole thing in a package, with configuration switches to control user-exposed parameters like L1 cache size, ARM believes it has encapsulated most of the things that can go wrong in a multiprocessor design so that licensees don't have to deal with them. A clear idea of design needs, sensible floor planning, fast memory and ARM's synthesis directives should carry the day, Steer suggested.

The block may contain up to four ARM-11-class cores. Each may come with ARM's vector-floating-point unit, and each has its own four-way set-associative L1 instruction and data caches. The size may be set by the user, but for symmetric multiprocessing all four pairs of L1s must be the same size. The cores are served by an array of inside-the-block hardware.

Power comes in for serious treatment as well. ARM has applied its Intelligent Energy Management architecture to the cores, allowing dynamic adjustment of both operating frequency and voltage to meet task deadlines. One limitation is that, to keep the problem within reason, the same voltage and frequency must be applied to all the CPUs in the block at once. ARM estimates that the maximum frequency for the cores will be about 550 MHz in a process such as TSMC 130-nanometer LV.

Power reduction is also a theme for Tensilica Inc. The company will unveil its LX processor at the conference, showing two lines of thinking that will have dramatic impact on dynamic power consumption. The company has lashed out at the power monster by massive use of gated clocks in both the base processor module and the user-defined execution units. The company said that in the processor compiled for Berkeley Design Technology Inc.'s benchmark suite, for example, there are 431 gated clock regions. Gating is controlled by demand-driven algorithms, so that unless a particular circuit is required, it is not clocked. While it is a technique that ARM has used extensively for some time, it is just appearing in configurable processors.

Another power problem is addressed by a second major idea in the LX. Tensilica has preached that small, efficient processors can take over the work of dedicated hardware blocks, simplifying hardware design — just drop in another processor instance instead of designing a big functional block in RTL — and often coming near the efficiency of dedicated hardware.

But a stumbling block for this argument is inherent in stored-program computers. While the dedicated hardware just goes through its predetermined sequence of states, the CPU core has to fetch an instruction on every cycle, articulating a bunch of buses, clocking a great deal of hardware and generally burning energy. Worse, the processor has to explicitly load and store each word of data on which it operates, burning even more energy.

This latter problem has been attacked in an interesting way. In the LX, Tensilica has for the first time provided a FIFO-buffered port directly in and out of the execution pipeline, bypassing the register file, in instruction register and the need for load or store instructions. Those paths, which are implemented in the CPU when the user defines instructions that directly sink or source external data, give the ideal combination of direct access to the execution pipeline and the safety of a Tensilica-generated, buffered port. The ports simply appear as native registers to the user-defined instructions that employ them. As an additional measure, Tensilica has added an optional second load/store unit to the LX, giving it the ability to extract higher bandwidth from its caches.

×
Semiconductor IP