Revving video encoding on C64x/DM64x DSPs
EE Times: Revving video encoding on C64x/DM64x DSPs | |
Cheng Peng (06/20/2005 9:00 AM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=164900190 | |
How does one improve video encoding performance on the Texas Instruments TMS320C64x/TMS320DM64x digital signal processor generation? The conventional implementation of a video encoder (MPEG-2, MPEG-4, H.263) is based on macroblock-level processing. The video encoder fetches a new macroblock (MB) only after the current MB goes through all the processing steps. But this intuitive approach comes with two drawbacks: - The overall code size of a video encoder is usually bigger than the Level 1 program cache (L1P) on a C64x/DM64x DSP. The code needs to be swapped between L1P and the Level 2 program cache (L2P) during every MB fetching period, causing a significant cache-miss penalty. - It is not efficient for the enhanced DMA (EDMA) controller to transfer a small chunk of data such as a single MB from an external video frame memory to internal memory. To avoid the huge cache-miss penalty and CPU stalling, the algorithm can be broken into three loops, each of them a separate module that fits into L1P. Instead of processing a single MB at a time in each loop, the module processes n macroblocks-an MB strip. The size of a strip is restricted only by the size of the available Level 1 data cache (L1D). The bigger n is, the better EDMA performance we can expect for data throughput. The three loops are: - the MB encoding loop, - the motion estimation loop, and - the MB reconstruction loop. As emphasized above, n MBs are fetched and go through one of the three processing loops together. For example, in the MB encoding loop, when n MBs are fetched into internal memory, they are put through a discrete cosine transform (DCT), quantized and entropy-coded. This set of macroblocks is not flushed out of L1D until the MB encoding loop has been completed. Corresponding programs, including DCT, quantization and variable-length coding kernels, are also kept in L1P until all n MBs are processed completely in this loop. A ping-pong memory buffering scheme driven by the EDMA helps reduce the initial setup time needed to perform these loops for a strip of MBs. Ping-pong buffering also ensures minimal CPU stalling cycles because the transfers are overlapped with processing. Cheng Peng (c-peng2@ti.com), DSP video application engineer for Texas Instruments Inc. (Dallas)
| |
- - | |
Related Semiconductor IP
- AES GCM IP Core
- High Speed Ethernet Quad 10G to 100G PCS
- High Speed Ethernet Gen-2 Quad 100G PCS IP
- High Speed Ethernet 4/2/1-Lane 100G PCS
- High Speed Ethernet 2/4/8-Lane 200G/400G PCS
Related White Papers
- Emerging H.264 standard supports broadcast video encoding
- Meeting the Challenge of Real-Time Video Encoding: Migrating From H.263 to H.264
- Achieving Optimized DSP Encoding for Video Applications
- Efficient radix-4 FFT on StarCore SC3000 DSPs
Latest White Papers
- New Realities Demand a New Approach to System Verification and Validation
- How silicon and circuit optimizations help FPGAs offer lower size, power and cost in video bridging applications
- Sustainable Hardware Specialization
- PCIe IP With Enhanced Security For The Automotive Market
- Top 5 Reasons why CPU is the Best Processor for AI Inference