Customized DSP -> Wider DSPs enrich comms design
Wider DSPs enrich comms design
By Gordon Sterling, DSP Software Engineering Manager, Analog Devices Inc., Norwood, Mass., EE Times
March 12, 2001 (1:44 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010312S0092
A new generation of 32-bit digital signal processors (DSPs) is being produced, providing greater efficiency for 32-bit applications and economy for 16-bit applications that can be streamlined using the richer instruction set of most 32-bit DSPs. The ADI 32-bit DSP core processor in the ADSP-21160 exemplifies this trend. The device can be used for various applications, including implementation of International Telecommunication Union (ITU) and European Telecommunications Standards Institute (ETSI) recommendations. The instruction word width of 32-bit DSPs is typically much larger than those of 16-bit DSPs. These additional bits provide greater parallelism, allowing several different functional units to be active during a single instruction cycle. The ADSP-21160 DSP is able to execute many instructions in parallel including a multiply-accumulate instruction, a multiply with separate addition, and multiple memory accesses in a single cycle. An obvious benefit of a 32-bit DSP over a 16-bit DSP is its ability to easily manipulate 32-bit operands and results. Typically 16-bit DSPs will have a one or two multiply-accumulate register that will allow greater than 16 bits of precision, but the other compute units (ALU, shifter) are limited to 16 bits. On a 16-bit DSP, an application, or portion of an application, which requires greater than 16-bit precision, will be forced to use multiple operations. For example, an adaptive filter, which might be used in a voice-band echo canceller, may require 32-bit coefficients to achieve the desired results. In a 16-bit DSP, these coefficients would require three to six times the number of instructions to update each coefficient. A 32-bit DSP can be used to operate on 16-bit data, producing the same results as those from a 16-bit DSP. In this case, the data operands are placed into either the upper or lower bits of the 32-bit registers. This wastes a portion of the functional units' word size, but does allow for true 16-bit operations to be executed. One application of this technique is the implementation of 16-bit recommendations (voice coders or other bit-exact algorithms) of the ITU and ETSI standardization bodies. These standards will rely on bit-exact 16- and 32-bit operations to verify conformance to the recommendations. One such bit-exact operation is the 21160's L_Mac () instruction, which provides a 16-bit x 16-bit fractional multiply-accumulate into 32-bit results. Both the multiply and accumulate need to be saturated. The multiply can only overflow in the case of a 16-bit 0 x 8,000 multiplied by a 16-bit 0 x 8,000. In 1.15 representation (a common numerical format for these recommendations), this value represents -1. The resul t of -1 x -1 should produce a 1, but in the world of fractional math, the result overflows and, without saturation or reformatting, stays -1. Fortunately many of the recommendations prescale the input data or coefficients to make this case unlikely. The accumulation can easily overflow, however, and needs to be considered in the operation. For this case, most DSPs offer an automatic saturation mode on ALU operations. In this mode the ALU will automatically saturate the result of an addition or subtraction to maximum full-scale values. In order to emulate 16-bit fractional operations easily on a 32-bit DSP, the operands can be stored in the upper bits of the 32-bit registers. This provides proper results for fractional multiples, which result in 32-bit products (assuming no overflow). These 32-bit products can be accumulated in another 32-bit register, rather than an extended precision Mac register. For example on the ADSP-21160 32-bit DSP, the short sequence of code could be used to r epresent a pipelined L_Mac () operation. In such as case, the result of the multiply from the last cycle is added to the running sum in this cycle. The ALU is set for automatic saturation, which avoids the need to explicitly test overflow of the accumulation. That allows the L_Mac () operation to occur in one cycle, for each input value. In addition to the N cycles required for N L_Macs(), one cycle to fill the pipeline (the first multiply) and one cycle to drain the pipeline (the last addition) must be added. Each L_Mac () loop would execute in O(N+2) cycles. The 16-bit ALU functions such as add can be accomplished just like they are on a 16-bit DSP. In this case the input values are shifted into the 16-bit most significant bits (MSBs) of the register. The saturation mode of the DSP will correctly saturate these values if an overflow occurs. A powerful feature of some 32-bit DSPs is the addition of another functional unit. This second functional unit can sometimes be used in parallel with th e primary unit to double the effective Mips of the processor. These processors are usually executing in a single instruction, multiple data (SIMD) mode to take advantage of this feature. In this mode both units execute the same instruction, but act on different data and coefficient inputs. The coefficients and data are placed in memory such that each unit fetches a new pair at each cycle. The units accumulate two separate running sums, and these need to be added together at the end to produce the final result. Using SIMD operations, these DSPs can achieve twice the effective throughput. While such a feature can easily be used in custom application, it can be more difficult to fit into the bit-exact model standards bodies use. In some cases the input to various loops has been prescaled to make overflow very unlikely. It's possible to break loops into two sections, and use the parallel functional units to accomplish the L_Mac() in half the time of a single functional unit. In such a case, the operation is identical to that shown above, except for two things. First the mode of the processor is changed to enable the SIMD operation. Second, an additional sum is done at the end of the loop to add the partial products from the two functional units. A loop of this form will execute in O((N+2)/2+1) or O(N/2+2). If it's not possible to verify that the Mac will not overflow during its loop, it may still be possible to take advantage of the SIMD operation of the processor. In this case the additional functional unit can be used to compute the next invocation of the function. In this instance, the primary functional unit computes the current invocation, while the secondary unit is used to precompute the next invocation of the loop. Each loop executes in O(N) cycles, but two invocations are computed simultaneously. Bit operations are another area worthy of exploration. Bit operations such as packing, unpacking, scanning for bit patterns and other bit-oriented operations benefit from large word DSPs. T ypically a routine such as the zero-bit insertion and deletion required for the High-level Data Link Control (HDLC) protocol will operate on a single word at a time. If words are supplied in 32-bit chunks instead of 16-bit chunks, the DSP is able to process more input bits per iteration, reducing overhead. For example, the HDLC recommendations use a sequence of six or more one-bits in a row to indicate a flag or abort condition. This means that data sent using this protocol must insert a zero-bit anytime the data stream contains more then 5 one-bits in a row. This protocol is referred to as flag transparency. In order to accomplish zero-bit insertion in a data stream, a DSP needs to count the number of successive one-bits in a data stream and insert an extra zero anytime there are more than 5 one-bits in a row. In order to save memory and data I/O, the bits are likely to be packed into the full word size of the DSP. On a typical 16-bit DSP, the inner loop that counts the successive one-bits m ight require 7 or so cycles normally, and an additional 4 cycles when a zero-bit is added to the data stream. In addition to this inner loop, an outer loop would need to run for each input word, and logic would be required to handle the bits remaining in the queue at the end of the data stream. In addition to the per-word savings, a 32-bit DSP's richer instruction set provides savings. The inner loop could be reduced to 8 cycles per bit for the entire operation. The bit-field extract and deposit functions of the ADSP21160 can also be used to efficiently implement bit FIFOs for storing input or output bit streams. That takes advantage of the shifter overflow status flag to compute when a running bit field exceeds a 32-bit word, and overflows into the next word of the FIFO. Doing conditional memory accesses makes it easier to keep track of the next available word in memory.
A 32-bit DSP could update the full coefficient much faster, because the entire coefficient can be updated in only one or two instructions. Applications that require 32-bit precision will execute more efficiently on a 32-bit DSP.
On a 32-bit DSP the overhead for each word would be reduced, because the same amount of data would be packed into half the number of words used for a 16-bit DSP. If the per-word overhead for a 16-bit DSP is around 6 cycles, the total per-word overhead for an N-bit data stream is Int(N/16) x 6; on a 32-bit DSP it would be halved.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Customized DSP -> Parallel DSPs: speed at a price
- Customized DSP -> Use-specific theme takes new shape in DSPs
- High-Performance DSPs -> Software-defined radio infrastructure taps DSP
- High-Performance DSPs -> DSP performance: Useful work per clock tick
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience