Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch

By Vijay Pawar of Cadence and Matthias Cremon of Meta

October 22, 2025

Introduction

Deploying PyTorch models on embedded devices, especially audio DSPs, presents unique challenges. To address these, Cadence and Meta have collaborated to create a robust, high-performance framework for deploying machine learning models on Cadence's Tensilica HiFi DSP family. By leveraging ExecuTorch and applying both graph-level and operator-level optimizations, the teams have achieved speedups of at least an order of magnitude compared to standard out-of-the-box deployments.

ExecuTorch

ExecuTorch is a solution for training and inference on the edge, designed for portability, productivity, and performance. It supports a wide variety of platforms, from mobile phones to embedded systems and microcontrollers, and enables developers to use familiar PyTorch toolchains for model authoring, conversion, debugging, and deployment. ExecuTorch provides a lightweight runtime and leverages full hardware capabilities, including CPUs, GPUs, NPUs, and DSPs.

Tensilica HiFi DSP Family

The Cadence Tensilica HiFi DSP family for audio, voice, speech, and AI offers low-energy, high-performance, highly optimized DSP solutions that span the entire spectrum of audio and voice algorithms and end equipment. Audio/voice/speech (AVS) processing covers a wide range of performance- and power-consumption requirements. At one end of the spectrum is the ultra-low-power "wake-on-voice" processing used in many of today's smartphones and wearables. At the other end, building state-of-the-art voice-controlled digital assistants requires advanced audio digital signal processing capabilities to efficiently run neural network-based speech recognition. The Tensilica HiFi DSP family includes multiple products ranging from the HiFi 1s DSP at the low end to the highest performing HiFi 5s DSP.

Performance Highlights

Cadence and Meta have collaborated to improve the performance of various neural network (NN) operators on the Tensilica HiFi 4 DSP using the HiFi NN library. Demonstrated using seven open-source models from the ExecuTorch repository, the results show dramatic improvements over standard out-of-the-box deployments:

Model	Output Size	Base FPS @ 500MHz	Optimized FPS @ 500MHz
RNNT Predictor	[1, 10, 256]	146.5	2875.6
RNNT Encoder	[1, 25, 256]	5.9	82
RNNT Joiner	[1, 25, 10, 128]	9.9	261.1
Baby Llama (1 layer)	[1, 512]	0.5	6.5
Resnet-18	[1, 1000]	0.2	7.7
Resnet-50	[1, 1000]	0.1	3.6
MobileNetv2	[1, 1000]	0.7	12.4

Operator Coverage and Data Types

ExecuTorch now supports a wide range of operators and data types that are optimized for Tensilica HiFi DSPs:

Compute Operators: Fully Connected, Matrix Mul, Convolution 1D/2D, Depthwise Convolution, Dilated Convolution
Non-linear Activations: Sigmoid, Tanh, Softmax, ReLU
Elementwise Operators: Add, Sub, Mul, Div, Quantize, Dequantize
Normalization Operators: Mean, Squared-diff, Reciprocal-square-root, Min, Max
Reorg Operators: Copy, Slice, Transpose, Concatenation
Activation Data Types: asymmetrically quantized signed int8, asymmetrically quantized unsigned int8, float32
Weight Data Types: symmetrically quantized int8, symmetrically quantized int8, float32

Current and Future Work

Opportunities remain to further enhance performance and expand support across different DSPs within the Tensilica HiFi family and beyond. Ongoing and future initiatives include:

Expanding DSP Support: Enabling additional Tensilica HiFi DSPs, such as the HiFi 1s DSP (ideal for always-on, energy-efficient applications, and small NN workloads) and the HiFi 5s DSP (NN-ready, offering approximately a 4X performance boost over the HiFi 4 DSP)
Quantization Improvements: Introducing 16-bit activation support in the quantizer
Latency Optimizations: Investigating fused layers (e.g., LSTM, GRU) for further latency reduction

Conclusion

With seven models now available as open source and running with optimized operators, Cadence and Meta have demonstrated that deploying PyTorch models on DSPs can be both efficient and scalable. Continued collaboration promises even greater performance and broader applicability for embedded machine learning deployments.

Learn more about Cadence Tensilica HiFi DSPs and ExecuTorch.

Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch

Introduction

ExecuTorch

Tensilica HiFi DSP Family

Performance Highlights

Operator Coverage and Data Types

Current and Future Work

Conclusion

Related Semiconductor IP

Related Blogs

Latest Blogs

Running Optimized PyTorch Models on Cadence DSPs with ExecuTorch

Introduction

ExecuTorch

Tensilica HiFi DSP Family

Performance Highlights

Operator Coverage and Data Types

Current and Future Work

Conclusion

Subscribe to the Semi IP Hub Newsletter

Related Semiconductor IP

Related Blogs

Latest Blogs