SoC architects look to programmability
SoC architects look to programmability
By Ron Wilson, EE Times
September 16, 2002 (4:12 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020916S0079
SAN MATEO, Calif. Architects in communications, network processing and numerically intensive applications are moving away from the traditional system-on-chip, composed of CPU cores, DSP cores and specialized hardware modules. Instead, they are moving toward software solutions, which is forcing designers to respond in very creative ways.
As MIPS Technologies Inc. vice president of product strategy Keith Diefendorff points out, the search for increased performance has shifted from the pursuit of higher clock frequencies and toward exploitation of parallelism. There are basically four places to look for parallelism. With instruction-level parallelism, one can exploit the ability to execute several instructions at the same time because the instructions are independent of each other. Thread-level parallelism allows one to look for whole independent sequences of instructions that occur between programmed context switches. Process-level parall elism lets one identify whole tasks that can be performed in parallel. Alternatively, single-instruction, multiple data, or SIMD, parallelism enables parallel execution by breaking the data up into independent segments and running the same program simultaneously on multiple sets of data. Each of those approaches has its own hardware implementation.
The earliest approaches explicitly in the SoC world were the superscalar, pipelined RISC architectures to exploit the low-hanging fruit in instruction-level parallelism. The next major step came from a pair of rival CPU core IP companies, both offering configurable processor cores. Vendors ARC (San Jose, Calif.) and Tensilica Inc. (Santa Clara, Calif.) served notice of their importance to SoC this year by demonstrating enormous performance increases on a variety of tasks, as measured by the Embedded Microprocessor Benchmark Consortium benchmarks, through specialized custom instructions. These instructions exploited parallelism and eliminated unnecessary in struction fetch or execution cycles, in some cases increasing throughput by a factor of 30 or more.
Another major modification hardware adaptations for multithreading is occurring among the more traditional core vendors. With relatively few changes, a CPU core can be adapted to switch between active threads in a very short time. This lets the CPU pick up one thread while it is waiting for a blockage on another thread a pending interrupt, a memory transfer or even a cache miss. If the application is organized as a pool of threads, each with its own guard conditions, then multiple processors in a shared-memory system can dip into the pool at will, executing many threads at once.
Variations in this approach are being explored for algorithms in which instruction-level or data-level parallelism have already been identified.
At an even higher level, blocks of code or data can be partitioned so that there is little or no interdependence among them and these blocks can b e executed in parallel on separate CPUs. With the enormous transistor budgets of 130-nm and 90-nm SoCs, it is feasible to place large CPU clusters on one die, allowing both large local memories and very high bandwidths between CPU caches. This approach, called chip-scale multiprocessing, superficially resembles the multiprocessor systems that have been available for years. But with the high bandwidth of on-chip interconnect, it is proving to have its own dynamics.
The ultimate extension of chip-scale multiprocessing would be to have not a small cluster of CPUs, but a very large number, interconnected on the die by a network that provided bandwidth much greater than bus bandwidth. In such a computing fabric, tasks could be mapped directly onto CPUs almost as easily as drawing a block diagram. At this point, the systems began to resemble custom data-flow processors. And that is exactly what at least two startup vendors had in mind.
Perhaps the most radical step in this new direction is the one taken by startup QuickSilver Technologies (San Jose, Calif.), which has developed an architecture based on an interconnected array of configurable processors. In fact, the processing elements themselves are configurable arrays of interconnected execution units, adding a nice symmetry to the concept. This notion of multiple levels of similarly configurable execution sites, extendable to the chip level and beyond, prompted QuickSilver to exercise some poetic license and call the concept a fractal architecture.
In any case, the architecture can be adapted to the computing needs of a particular class of applications by controlling the type and number of individual processing elements. The first embodiment of the QuickSilver approach is a chip developed for wireless applications. It employs four types of execution nodes: scalar (actually a commercial RISC core), arithmetic, bit manipulation and finite state machine nodes. Combinations of these nodes are clustered around an interconnect switch matrix, whic h in turn is connected to local and chipwide routing.
The individual nodes comprise clusters of execution units, which can in turn be interconnected and configured by means of configuration memory. The system was designed so that configurations can be paged in very quickly, allowing, according to the company, the chip to completely change configurations at nearly 60 kHz. The individual computing elements are quite powerful. The company claims that a single arithmetic node can execute a variable-width transcendental function, an FFT or FIR filter algorithm. Each node has an associated embedded memory block that stores code, data and configuration.
Developing a flow
The nodes and their interconnect switches form a fabric onto which data-flow diagrams may be mapped. QuickSilver said this permits designers to move naturally from a block diagram to a data-flow-oriented mathematical modeling environment, to an augmented version of C to an implementation on the chip. There is no r ole for hardware description languages in the flow.
Another startup, picoChip (Bath, England), took a similar approach. The company targeted wireless signal processing, and worked from a point of view similar to QuickSilver's How do we create a computing structure that can execute algorithms without custom hardware?
The picoChip architects also came up with a cluster of CPUs, said founder and chief technical officer Doug Pulley. Without going into details, Pulley described the architecture as a large array of relatively small processors embedded in a strictly deterministic interconnect structure. Like the QuickSilver design, the picoChip processors are associated with large blocks of embedded memory. And like the QuickSilver design, the array is heterogeneous, with different processors equipped with different specialized instructions to accelerate different tasks. Pulley said the processor array's granularity was chosen to closely match the granularity of the tasks in wireless applicat ions.
The design flow for the picoChip architecture differs from that for the QuickSilver chip, however. Pulley said that the processors, the tools and the flow were all crafted to be HDL-like for the benefit of ASIC designers rather than C like for software engineers or algorithm designers. Unlike more software-oriented professionals, Pulley suggested, ASIC designers tend to think in terms of concurrency and identifying the potential parallelism in a situation.
The similarity between the QuickSilver and picoChip architectures is reminiscent of discussions several years ago about intelligent memory architectures. Memory, it was argued, was becoming the dominant component of embedded systems. Future systems would have memory blocks with data flows between them. Transforms that implemented algorithms would be embedded, almost incidentally, in memory arrays. While the industry isn't there yet, the new architectures suggest that it is indeed headed down that path, as architects strive to k eep as much of the implementation specifics as possible in software.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Microcontroller Architects Look to Embedded FPGAs for Flexibility
- Do SoC Architects Have to Get Physical?
- How a voltage glitch attack could cripple your SoC or MCU - and how to securely protect it
- It's Time to Look at FD-SOI (Again)
Latest White Papers
- Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard