Benchmarking Ultra-Low-Power 𝜇NPUs

By Josh Millar 1, Yushan Huang 1, Sarab Sethi 1, Hamed Haddadi 1 and Anil Madhavapeddy 2
Imperial College London
University of Cambridge

Abstract

Efficient on-device neural network (NN) inference has various advantages over cloud-based processing, including predictable latency, enhanced privacy, greater reliability, and reduced operating costs for vendors. This has sparked the recent rapid development of microcontroller-scale NN accelerators, often referred to as neural processing units (𝜇NPUs), designed specifically for ultra-low-power applications.

In this paper we present the first comparative evaluation of a number of commercially-available 𝜇NPUs, as well as the first independent benchmarks for several of these platforms. We develop and open-source a model compilation framework to enable consistent benchmarking of quantized models across diverse 𝜇NPU hardware. Our benchmark targets endto-end performance and includes model inference latency, power consumption, and memory overhead, alongside other factors. The resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including 𝜇NPUs exhibiting unexpected scaling behaviors with increasing model complexity. Our framework provides a foundation for further evaluation of 𝜇NPU platforms alongside valuable insights for both hardware designers and software developers in this rapidly evolving space.

1. INTRODUCTION

Performing neural network (NN) inference on constrained devices has applications across numerous domains, including wearable health monitoring [1], smart agriculture [2], realtime audio processing [3], and predictive maintenance [4]. On-device inference offers various advantages over cloudbased alternatives: improved latency for time-critical applications, enhanced privacy, as well as reduced operating costs for vendors, by eliminating the need to transmit sensitive data, and improved reliability by removing dependence on network connectivity. Given their unique form factor and low power consumption, microcontrollers (MCUs) are widely used in resource-constrained environments. However, their performance is often constrained by limitations in memory capacity, throughput, and compute.

The computational demands of modern neural networks (NNs) have catalyzed the development of specialized hardware accelerators across the computing spectrum, from highperformance data centers to ultra-low-power and embedded devices. At the resource-constrained end of the spectrum, microcontroller-scale neural processing units (𝜇NPUs) have recently emerged, designed to operate within extremely tight power envelopes — in the milliwatt or sub-milliwatt range — while still providing low latency for real-time inference. These devices represent a new class of accelerator, combining the power efficiency of MCUs with the cognitive capabilities previously exclusive to more powerful computing platforms. The core advantage of 𝜇NPUs stems from their ability to exploit the inherent parallelism of neural networks with dedicated multiply-accumulate (MAC) arrays alongside specialized memory structures for weight storage. Such architectural specialization enables 𝜇NPUs to achieve orders of magnitude improvement in latency compared to generalpurpose MCUs executing equivalent workloads.

Despite the growing availability of 𝜇NPU platforms, the field lacks a standardized evaluation or comprehensive benchmark suite. Existing benchmarks focus solely on Analog Devices’ MAX78000 [5–7], lacking any side-by-side comparison with other platforms. Hardware vendors provide performance metrics based on proprietary evaluation frameworks, often using disparate NN models, quantization strategies, and other varying optimizations. This heterogeneity across evaluation methods, and absence of independent verification of vendor-provided performance claims, creates uncertainty for hardware designers and embedded software developers in selecting the most suitable 𝜇NPU platform for their application’s constraints. The lack of standardized benchmarking also hampers research by obscuring the relationship between architectural design and real-world performance.

Given the rapid pace of development and increasing diversity of available 𝜇NPU platforms, establishing reliable comparative benchmarks has become an urgent need for the field. To this end, we make the following contributions:

  • Side-by-Side Benchmark of 𝜇NPU Platforms: We conduct the first comparative evaluation of commerciallyavailable 𝜇NPU platforms, enabling direct performance comparisons across diverse hardware architectures under consistent workloads and measurement conditions.
  • Independent Benchmarks: We also provide the first fine-grained and independent performance benchmarks for several 𝜇NPU platforms that have not previously been subject to third-party evaluation, offering unbiased verification of vendor performance claims.
  • Open-Source Model Compilation Framework: We develop and release1 an open-source framework that enables consistent and simplified transplanting of NN models across diverse 𝜇NPU platforms, reducing the engineering overhead associated with cross-platform evaluation.
  • Developer Recommendations: Informed by our benchmark results, we provide actionable recommendations to developers regarding platform selection, key focus areas for model optimization, and trade-offs for various application scenarios and constraints.

In developing a unified compilation and benchmarking framework, we standardize model representations across the various 𝜇NPU platforms, enabling direct comparison of latency, memory, and energy performance. Our evaluation also includes fine-grained analysis of the various model execution stages, from NPU initialization and memory input/output overheads to CPU pre/post-processing – aspects that can significantly impact end-to-end performance but are often not addressed in technical evaluations. The resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including 𝜇NPUs exhibiting unexpected scaling behaviors with increasing model complexity. We hope our findings provide valuable insights to both developers and hardware architects.

To read the full article, click here

×
Semiconductor IP