Neural Network Model quantization on mobile
The general definition of quantization states that it is the process of mapping continuous infinite values to a smaller set of discrete finite values. In this blog, we will talk about quantization in the context of Neural Network (NN) models, as the process of reducing the precision of the weights, biases, and activations. Moving from floating-point representations to low-precision fixed integer values holds the potential of substantially reducing the memory footprint and latency. This is crucial for deploying models on mobile devices and edge platforms, where runtime computational resources are restricted. There is also an increased focus on quantization’s importance due to the latest developments in generative and Large Language Models (LLM), and the need to bring them to mobile space.
This blog intends to provide a picture of the current state of quantization on mobile (Android) and the opportunities it opens to bring inference of complex NN models to the edge. The first section provides an overview of existing quantization methods and classifications. The second section discusses and compares the main two quantization approaches in TensorFlow Lite (TFLite): Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). Due to the increasing importance of LLMs and generative models, the last section is devoted to some of the challenges of Transformers models, where mixed-precision quantization is the preferred approach.
To read the full article, click here
Related Semiconductor IP
- ISO/IEC 7816 Verification IP
- 50MHz to 800MHz Integer-N RC Phase-Locked Loop on SMIC 55nm LL
- Simulation VIP for AMBA CHI-C2C
- Process/Voltage/Temperature Sensor with Self-calibration (Supply voltage 1.2V) - TSMC 3nm N3P
- USB 20Gbps Device Controller
Related Blogs
- Reviewing different Neural Network Models for Multi-Agent games on Arm using Unity
- Benefit of pruning and clustering a neural network for before deploying on Arm Ethos-U NPU
- Efficiently Packing Neural Network AI Model for the Edge
- FPGAs take on convolutional neural networks
Latest Blogs
- A Comparison on Different AMBA 5 CHI Verification IPs
- Cadence Recognized as TSMC OIP Partner of the Year at 2025 OIP Ecosystem Forum
- Accelerating Development Cycles and Scalable, High-Performance On-Device AI with New Arm Lumex CSS Platform
- Desktop-Quality Ray-Traced Gaming and Intelligent AI Performance on Mobile with New Arm Mali G1-Ultra GPU
- Powering Scale Up and Scale Out with 224G SerDes for UALink and Ultra Ethernet