Customized DSP -> Applications take the driver's seat
Applications take the driver's seat
By Ray Simar, Fellow, Texas Instruments Inc., Stafford, Texas, EE Times
March 12, 2001 (1:36 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010312S0087
The days when digital signal processors were designed first, then fitted to the application, are long gone. The universe of DSP applications has grown so large that the "one-DSP-fits-all-needs" philosophy no longer applies. Today, the key to successful DSP design lies in working from the application in, rather than from the core out. DSP core design, like DSP chip-level design, cannot be separated from the needs of the system. Even fundamental architectural decisions are increasingly based on end-equipment requirements, so that DSP core design has now become more application oriented than ever. Like all integrated circuits, DSPs are subject to the three overriding optimization vectors: cost, performance and power consumption. Under ideal circumstances, all the processing ever needed would be available using negligible power and at no cost. Of course, since this ideal cannot be realized, DSP chip designers are forced to make trade-offs among the three vectors, based on the manufacturing-process technology available to them at a given time. Just as important, DSP system designers are forced to choose among the trade-offs presented by chip designers. This is where application requirements come in. In some areas, performance is the overriding consideration. These applications include third-generation (3G) wireless basestations, streaming-video servers and gateways, digital subscriber line access multiplexers (DSLAMs) and central office (CO) cards, and other types of multichannel concentration units that handle a high data throughput. Power is a secondary consideration in these systems, since less heat dissipation allows more channels to be packed into the same space, reducing operating costs per channel and reducing the energy needed to run the Internet. For mass-market applications, such as asymmetric-DSL modems and residential gateways, low cost is paramount. In the case of digital wireless handsets, low cost is the first require ment, followed by low power consumption as a close second, since it is vital for a handset to feature long talk and standby times between battery charges. Performance is important in these products, of course, but since only a certain level of performance is needed to fulfill the application, the factors that drive design are cost and power. Motor control illustrates this rule starkly, since performance overhead beyond the fundamental algorithm is wasted, unless the application offers value-added features. In high-volume applications like these, cost rules. But measuring any one of the three vectors is tricky, since requirements vary with the individual application. Raw performance expressed in Mips or millions of multiply-accumulate operations per second (MMACS) is only a rule of thumb, since the algorithms used for each type of application are significantly different. What the Mips or MMACS accomplish can only be determined by in-depth code analysis and careful testing in target systems. Similarly, system cost is generally more important than the cost of the individual device, so the DSP may integrate memory and peripherals that actually increase the size and cost of the die, even though they result in a savings for the board. Operating power, too, must be looked at in terms of usage. For instance, passive power dissipation normally has more effect on the time between battery charges in 3G wireless handsets than active operation has. So the microwatts consumed during standby modes are at least as important as the milliwatts during active modes. DSPs can also offload analog functions such as filtering from the system, saving external power and components, but increasing the device power consumption by processing the functions digitally. Like performance and cost, power consumption has to be calculated carefully in terms of the application. All of those factors have an impact on DSP design, from core to chip to software. No resource is overlooked when it comes to achieving the right bala nce for a given application among the three vectors of performance, cost and power consumption. Among the resources at the disposal of DSP core designers are the process technology, core and supporting chip architectures, instruction sets, and assemblers, compilers and other development software. Core design is the magnet that aligns all of those elements in meeting the requirements of a target market segment. Underlying all design factors is the manufacturing process. The transistor scale and metallization pitch at a given CMOS process node affect die size and thus cost. At the same time, smaller transistors can operate at reduced voltages and switch faster, minimizing power and increasing performance with faster switching. A snapshot of the industry today shows advanced DSP cores being manufactured with 0.15-micron transistor gate widths and core operating voltages of 1.5 volts. Soon, DSPs will appear with gate widths nearing 100 nanometers (or 0.1 micron) and core voltages below 1 V. The trend of scaling down transistor geometries and voltages will continue. Less widely known than these general observations is the fact that transistors in a given process node can be "tuned" to the desired application. Transistors with low transition thresholds are designed to stay on the verge of changing states so that they gain speed through faster switching but dissipate power slightly through leakage current. On the other hand, transistors with high transition thresholds are designed to stay more tightly clamped, conserving power but introducing a slight delay. Of course, these speed and power differentials are negligible for individual transistors, but considered in aggregate for millions of transistors, they make a considerable difference in the behavior of a device. DSP suppliers with in-house manufacturing can manipulate these transistor variables to make a core better suited for a given application. Licensed cores, on the other hand, are designed for use with various processes, so they cannot take ag gressive advantage of transistor tuning. Naturally, a DSP' s core architecture is central to its design and provides some of the most significant leverage for making trade-offs among the three optimization vectors. In recent years, very long instruction word (VLIW) architectures have become the norm for high performance. To decrease power consumption, designers have introduced new features into traditional architectural approaches. Cost is largely determined by the integration of memory and peripherals and by the process, since smaller-geometry processes pack more transistors into smaller, less expensive die spaces. VLIW architectures push performance by enabling parallel execution of multiple data sets. TI's TMS320C64x DSP core, for example, has eight 32-bit functional units, all of which can perform operations on different data sets simultaneously. The possible combinations of operations include up to four 16-bit MACs or eight 8-bit MACs, yielding up to 4,800 MMACS in 600-MHz operations. In telecommunications and video/imaging applications, where the aggregate data sets are enormous, this level of performance is indispensable for supporting the massive throughput needed. Although speed of execution is the overriding benefit of the VLIW architecture, supporting elements in a design can make costs and power consumption more acceptable. A direct-mapped memory capable of feeding the VLIW engine could be enormous and expensive to integrate. On the other hand, a two-level cache memory keeps die size down while still providing the throughput needed by the core. While it makes a small memory appear to the core like a large one, the cache also serves to reduce off-chip memory access and the power consumption needed for external access. Acceleration hardware can complement the programmable DSP core by performing the brunt of the work for specific algorithms that are heavily used in a target application. By off-loading predictable, heavily used routines, hardware acceleration extends the utility of the programmable core and boosts the overall performance of the device. New devices in the C64x family, for instance, add Viterbi and turbo acceleration for algorithms frequently used in 3G wireless communications. Viterbi acceleration is also included in DSPs aimed at wireless handsets, while other types of acceleration have been used to complement TI DSPs for high-speed networking, digital still cameras and other applications. Because the eight parallel data paths in a VLIW engine create complex instruction pipelining, a good compiler is essential, not only to keep the engine fed with data, but also to cut down on the programming complexity. Compilers and hardware structures such as cache management ease the burden of software design, allowing programmers to create DSP applications without an intimate knowledge of the underlying mechanics of the device. Top VLIW compilers achieve roughly 80 percent of the efficiency of hand-coded assembly, depending on the application. As a resu lt, developers can focus their resources on programming the code that can be of greatest benefit to the system. This concentration of resources is aided by new profile-based compilers, which help designers quickly assess trade-offs between code size and cycle counts. Profile-based compiling is extremely important when it comes to meeting the application's requirements of available memory vs. performance. Design techniques that reduce power consumption in both the core and the chip include static registers, latching buses to keep them from floating, and disabling clocks to logic functions and peripherals that are not in use. On-chip memory control is distributed and serves to funnel each access down to a 2-kword block, so that only enough power is consumed to enable that block for a given access. Core internal operations are designed to reduce both data reads and instruction fetches. With appropriate coding, the core allows the same coefficient to be read and loaded into both MAC units simultaneously for multiplication by two different array values. Dual-MAC operations can thus take place with only three data reads per cycle instead of four, resulting in a considerable power reduction for many routines. Instruction fetches are minimized by using a scalable instruction word and a 32-bit instruction bus that can spool multiple instructions in a single fetch cycle. Although DSP design has never existed completely in isolation from application needs, today more than ever DSPs are being tailored to the requirements of specific end uses. Because DSP suppliers are forced to make trade-offs at all levels to optimize performance, cost and power consumption, every DSP must be tailored to meet the needs of a market area. The core is at the center of this decision-making process, and the philosophy behind core development drives every other design decision from transistor tuning to software platforms. As the universe of DSP applications grows, DSP development will become even more application oriented.
For applications where low cost and power are even more important than raw speed, traditional DSP architectures are being significantly reshaped. TI's TMS-320C55x DSP core, for example, was designed specifically for the needs of 3G wireless handsets, where low cost and power consumption go hand in hand with a higher level of performance than that needed in traditional voice-only handsets. The core uses a dual-MAC data path for greater parallelism.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Customized DSP -> Use-specific theme takes new shape in DSPs
- DSP cores take comms focus
- Customized DSP -> Wider DSPs enrich comms design
- Customized DSP -> VLIW calls for special debugging
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience