Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch
Introduction
Deploying PyTorch models on embedded devices, especially audio DSPs, presents unique challenges. To address these, Cadence and Meta have collaborated to create a robust, high-performance framework for deploying machine learning models on Cadence's Tensilica HiFi DSP family. By leveraging ExecuTorch and applying both graph-level and operator-level optimizations, the teams have achieved speedups of at least an order of magnitude compared to standard out-of-the-box deployments.
ExecuTorch
ExecuTorch is a solution for training and inference on the edge, designed for portability, productivity, and performance. It supports a wide variety of platforms, from mobile phones to embedded systems and microcontrollers, and enables developers to use familiar PyTorch toolchains for model authoring, conversion, debugging, and deployment. ExecuTorch provides a lightweight runtime and leverages full hardware capabilities, including CPUs, GPUs, NPUs, and DSPs.
Tensilica HiFi DSP Family
The Cadence Tensilica HiFi DSP family for audio, voice, speech, and AI offers low-energy, high-performance, highly optimized DSP solutions that span the entire spectrum of audio and voice algorithms and end equipment. Audio/voice/speech (AVS) processing covers a wide range of performance- and power-consumption requirements. At one end of the spectrum is the ultra-low-power "wake-on-voice" processing used in many of today's smartphones and wearables. At the other end, building state-of-the-art voice-controlled digital assistants requires advanced audio digital signal processing capabilities to efficiently run neural network-based speech recognition. The Tensilica HiFi DSP family includes multiple products ranging from the HiFi 1s DSP at the low end to the highest performing HiFi 5s DSP.
Performance Highlights
Cadence and Meta have collaborated to improve the performance of various neural network (NN) operators on the Tensilica HiFi 4 DSP using the HiFi NN library. Demonstrated using seven open-source models from the ExecuTorch repository, the results show dramatic improvements over standard out-of-the-box deployments:
Model |
Output Size |
Base FPS |
Optimized FPS |
RNNT Predictor |
[1, 10, 256] |
146.5 |
2875.6 |
RNNT Encoder |
[1, 25, 256] |
5.9 |
82 |
RNNT Joiner |
[1, 25, 10, 128] |
9.9 |
261.1 |
Baby Llama (1 layer) |
[1, 512] |
0.5 |
6.5 |
Resnet-18 |
[1, 1000] |
0.2 |
7.7 |
Resnet-50 |
[1, 1000] |
0.1 |
3.6 |
MobileNetv2 |
[1, 1000] |
0.7 |
12.4 |
Operator Coverage and Data Types
ExecuTorch now supports a wide range of operators and data types that are optimized for Tensilica HiFi DSPs:
- Compute Operators: Fully Connected, Matrix Mul, Convolution 1D/2D, Depthwise Convolution, Dilated Convolution
- Non-linear Activations: Sigmoid, Tanh, Softmax, ReLU
- Elementwise Operators: Add, Sub, Mul, Div, Quantize, Dequantize
- Normalization Operators: Mean, Squared-diff, Reciprocal-square-root, Min, Max
- Reorg Operators: Copy, Slice, Transpose, Concatenation
- Activation Data Types: asymmetrically quantized signed int8, asymmetrically quantized unsigned int8, float32
- Weight Data Types: symmetrically quantized int8, symmetrically quantized int8, float32
Current and Future Work
Opportunities remain to further enhance performance and expand support across different DSPs within the Tensilica HiFi family and beyond. Ongoing and future initiatives include:
- Expanding DSP Support: Enabling additional Tensilica HiFi DSPs, such as the HiFi 1s DSP (ideal for always-on, energy-efficient applications, and small NN workloads) and the HiFi 5s DSP (NN-ready, offering approximately a 4X performance boost over the HiFi 4 DSP)
- Quantization Improvements: Introducing 16-bit activation support in the quantizer
- Latency Optimizations: Investigating fused layers (e.g., LSTM, GRU) for further latency reduction
Conclusion
With seven models now available as open source and running with optimized operators, Cadence and Meta have demonstrated that deploying PyTorch models on DSPs can be both efficient and scalable. Continued collaboration promises even greater performance and broader applicability for embedded machine learning deployments.
Learn more about Cadence Tensilica HiFi DSPs and ExecuTorch.
Related Semiconductor IP
- Post-Quantum Digital Signature IP Core
- Compact Embedded RISC-V Processor
- Power-OK Monitor
- RISC-V-Based, Open Source AI Accelerator for the Edge
- Securyzr™ neo Core Platform
Related Blogs
- Time-of-Flight Decoding with Tensilica Vision DSPs - AI's Role in ToF Decoding
- Accelerate Automotive System Design with Cadence AI-Driven DSPs
- Running LSTM neural networks on an Imagination NNA
- What Are Digital Twins? A Primer on Virtual Models
Latest Blogs
- Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch
- PCIe 6.x: Synopsys IP Selected as First Gold System for Compliance Testing
- Post-quantum security in platform management: PQShield is ready for SPDM 1.4
- Unleash Real-Time LiDAR Intelligence with Akida On-Chip AI
- Ceva Advancing Real-Time AI with Transformers and Intelligent Quantization