Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch

Introduction

Deploying PyTorch models on embedded devices, especially audio DSPs, presents unique challenges. To address these, Cadence and Meta have collaborated to create a robust, high-performance framework for deploying machine learning models on Cadence's Tensilica HiFi DSP family. By leveraging ExecuTorch and applying both graph-level and operator-level optimizations, the teams have achieved speedups of at least an order of magnitude compared to standard out-of-the-box deployments.

ExecuTorch

ExecuTorch is a solution for training and inference on the edge, designed for portability, productivity, and performance. It supports a wide variety of platforms, from mobile phones to embedded systems and microcontrollers, and enables developers to use familiar PyTorch toolchains for model authoring, conversion, debugging, and deployment. ExecuTorch provides a lightweight runtime and leverages full hardware capabilities, including CPUs, GPUs, NPUs, and DSPs.

Tensilica HiFi DSP Family

The Cadence Tensilica HiFi DSP family for audio, voice, speech, and AI offers low-energy, high-performance, highly optimized DSP solutions that span the entire spectrum of audio and voice algorithms and end equipment. Audio/voice/speech (AVS) processing covers a wide range of performance- and power-consumption requirements. At one end of the spectrum is the ultra-low-power "wake-on-voice" processing used in many of today's smartphones and wearables. At the other end, building state-of-the-art voice-controlled digital assistants requires advanced audio digital signal processing capabilities to efficiently run neural network-based speech recognition. The Tensilica HiFi DSP family includes multiple products ranging from the HiFi 1s DSP at the low end to the highest performing HiFi 5s DSP.

Performance Highlights

Cadence and Meta have collaborated to improve the performance of various neural network (NN) operators on the Tensilica HiFi 4 DSP using the HiFi NN library. Demonstrated using seven open-source models from the ExecuTorch repository, the results show dramatic improvements over standard out-of-the-box deployments:

Model

Output Size

Base FPS
@ 500MHz

Optimized FPS
@ 500MHz

RNNT Predictor

[1, 10, 256]

146.5

2875.6

RNNT Encoder

[1, 25, 256]

5.9

82

RNNT Joiner

[1, 25, 10, 128]

9.9

261.1

Baby Llama (1 layer)

[1, 512]

0.5

6.5

Resnet-18

[1, 1000]

0.2

7.7

Resnet-50

[1, 1000]

0.1

3.6

MobileNetv2

[1, 1000]

0.7

12.4

Operator Coverage and Data Types

ExecuTorch now supports a wide range of operators and data types that are optimized for Tensilica HiFi DSPs:

  • Compute Operators: Fully Connected, Matrix Mul, Convolution 1D/2D, Depthwise Convolution, Dilated Convolution
  • Non-linear Activations: Sigmoid, Tanh, Softmax, ReLU
  • Elementwise Operators: Add, Sub, Mul, Div, Quantize, Dequantize
  • Normalization Operators: Mean, Squared-diff, Reciprocal-square-root, Min, Max
  • Reorg Operators: Copy, Slice, Transpose, Concatenation
  • Activation Data Types: asymmetrically quantized signed int8, asymmetrically quantized unsigned int8, float32
  • Weight Data Types: symmetrically quantized int8, symmetrically quantized int8, float32

Current and Future Work

Opportunities remain to further enhance performance and expand support across different DSPs within the Tensilica HiFi family and beyond. Ongoing and future initiatives include:

  • Expanding DSP Support: Enabling additional Tensilica HiFi DSPs, such as the HiFi 1s DSP (ideal for always-on, energy-efficient applications, and small NN workloads) and the HiFi 5s DSP (NN-ready, offering approximately a 4X performance boost over the HiFi 4 DSP)
  • Quantization Improvements: Introducing 16-bit activation support in the quantizer
  • Latency Optimizations: Investigating fused layers (e.g., LSTM, GRU) for further latency reduction

Conclusion

With seven models now available as open source and running with optimized operators, Cadence and Meta have demonstrated that deploying PyTorch models on DSPs can be both efficient and scalable. Continued collaboration promises even greater performance and broader applicability for embedded machine learning deployments.

Learn more about Cadence Tensilica HiFi DSPs and ExecuTorch.

×
Semiconductor IP