Processor Architecture for High Performance Video Decode
High definition television (HDTV) content broadcasts over cable and from terrestrial and satellite stations are appearing in viewing markets around the world. Due to consumer demand and government mandates, HDTV capable digital consumer video equipment will be sold in high volumes in the next few years. Chipmakers, today, are designing the chips to drive these devices.
HDTV is displayed at a resolution of 1920 x 1080 pixels. That is six times the number of pixels of a standard 720 x 480 NTSC video display. To store that many more pixels on a standard DVD sized disk or broadcast that many more pixels with the same transmission bandwidth as standard definition television requires improved algorithms for video compression. Similarly, to transmit video in the limited bandwidth available to portable wireless devices requires improved algorithms for video compression.
Next generation video coding standards have been developed to deliver improved video compression. For a given display resolution, the computational requirement of a next generation codec may be as high as ten times that of MPEG2. This is compounded by the higher resolutions required by future video displays. A video processor for HDTV requires about 60 times the processing power of a video processor for NTSC television. That is nearly nine years worth of performance scaling according to Moore's law.
MPEG2 has been a nearly universal video compression standard for many years. Several video compression standards, offering greater compression-quality, hope to displace MPEG2 in next generation video applications. H.264 succeeds developments in H.263 and MPEG4 by the ITU-T and ISO open standards bodies. By lineage, H.264 is the grandchild of MPEG2. Microsoft's Windows Media Video 9 (WMV9) standard is making a play, with lower royalties, for many of the same application spaces as H.264. Promoted by such an influential company, WMV9 cannot be ignored. On2 Technologies' next generation VP6 video compression standard has also seen adoption in royalty sensitive markets. No next generation video standard promises the near universal acceptance that MPEG2 has enjoyed. We should expect a future in which consumer video devices must support multiple video compression standards.
MPEG2 video decode, within chips for consumer video applications, has been most often implemented in fixed-function hardware. MPEG2 decode can just as easily be implemented in software and run on a software programmable processor. The reason for using fixed-function hardware is that it can achieve the required performance with a smaller die area than a software programmable processor. A software programmable processor makes sense when the application requires more variations than can be supported in a fixed-function processor. For example, consumer video devices that require support for multiple video compression standards achieve a smaller die area by using a single software programmable processor for video decode than by using multiple pieces of fixed-function hardware. Different video compression standards are supported in such a processor by different software functions.
Because of the performance demanding complexities of next generation video compression standards neither today's blazingly fast Pentiums nor today's video optimized digital signal processors are capable of decoding bitstreams of next generation coding standards at the resolution and frame rate of HDTV. Designing a chip capable of overcoming the gap between today's processors and the demands of HDTV video compression standards is a necessity for manufacturers of next generation digital consumer video equipment. Ultra Data, for example, is currently developing an asymmetric, heterogeneous software programmable processor architecture capable of decoding bitstreams of next generation video standards for HDTV—the UD3000. The UD3000 achieves sufficiently high performance through a multi-core processor.
Video compression is achieved by exploiting spatial and temporal redundancy of data in the video frame sequence. As a result, data dependencies exist in the reconstruction of the decoded video frame sequence that prevent significant performance gains by increasing processor count in a symmetric manner. Dividing the video frame spatially among symmetric processors does not work because intra prediction (prediction from neighboring pixels) and deblocking filtering along the edges of the sub-frame processed by each processor would depend on the results of other processors. Interleaving the frame sequence temporally among symmetric processors does not work because inter prediction (prediction from previously coded frames) and adaptive parameters would depend on other processors' results.
A fixed-function hardware video decoder also implements a dataflow pipeline with heterogeneous processing elements. The only difference between a fixed-function hardware video decoder and an asymmetric heterogeneous multiprocessor with a pipelined data flow is that the multiprocessor can be programmed with different software functions to implement different video compression standards.
Each of these steps, for most video applications, operates on structures containing 8-bit data elements. Intermediate calculations within each step require 16-bit data resolution. The internal datapaths of today's general-purpose DSPs are optimized for audio, which operate on 16-, 24-, or 32-bit data. Such processors are not optimally efficient for video processing applications. Furthermore, video processing operates on two-dimensional data. Video processor instruction sets are optimized for two-dimensional filters and transforms using matrix multiplication unlike general-purpose DSPs that perform linear filters and transforms using simple multiplication operations. Video processors also like to access elements of two-dimensional arrays in horizontal and vertical sequences, unlike general-purpose DSPs.
Video decode applications have hard real-time requirements. For general-purpose processors that use caches and take interrupts it is difficult to achieve the performance required for real-time due to the non-deterministic length of time required to service cache misses and interrupts. In an asymmetric heterogeneous multiprocessor with a pipelined data flow for video it is possible to perform each off-chip reference data read early enough that the read data is available before it is needed. Since latency variation is masked by an early read, the only remaining concern is for sufficient off-chip memory bandwidth. An efficient multiprocessor architecture for video will include enough storage to hold all intermediate data structures locally in on-chip RAM to minimize off-chip bandwidth consumption.
The off-chip memory bandwidth required to decode the theoretical worst-case H.264 video bitstream exceeds the bandwidth available in a 64-bit wide 200 MHz DDR SDRAM. A large majority of the bandwidth is due to cycles spent waiting for DRAM row activation during the row address select phase of an access. Much of this delay is an unavoidable result of the fact that there is no predictability to inter prediction read sources. Optimal organization of data within memory can minimize but not eliminate cycles wasted for RAS delays. Organizing video frame buffers with square groups of pixels in consecutive DRAM addresses such that each group is the size of one DRAM row means that most block reads will fall within a single DRAM row and in the worst case will span groups stored in four different DRAM rows. This means that a block read will require at most four RAS delays.
Each of the step algorithms executed to transform a bitstream into a video display requires a different combination of decision-making and mathematical calculation. Traditional high-speed general-purpose RISC processors are efficient at the branching operations required for decision-making heavy algorithms. Traditional general-purpose DSPs, with their single instruction multiple data (SIMD) parallelism, are efficient at intense mathematical calculations. A VLIW architecture with parallel RISC and DSP execution units allows the most precise fine-grained matching of high performance mathematical operations and decision making.
When studying video processors consider the value of software programmability. Consider optimizing the mix of RISC and DSP processors for the decode and display steps required. Also consider the dataflow within the video processor and to off-chip storage. Finally, make sure that your architecture can achieve the required real-time performance for a worst-case bitstream with worst-case memory access latencies.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Software Infrastructure of an embedded Video Processor Core for Multimedia Solutions
- High-Performance DSPs -> Processor boards: Architecture drives performance
- An Embedded Processor Architecture With Extensive Support For SoC Debug
- Mapping LMS Adaptive Filter IP Core to Multiplier-Array FPGA Architecture for High Channel-Density VOIP Line Echo Cancellation
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience