High-Fidelity Conversion of Floating-Point Networks for Low-Precision Inference using Distillation
A major challenge for fast, low-power inference on neural network accelerators is the size of the models. There is a trend in recent years towards deeper neural networks with more activations and coefficients, and with this rise in model size, comes a corresponding rise in inference time and energy consumption per inference. This is particularly significant in resource-constrained mobile and automotive applications. Low-precision inference helps reduce inference cost by reducing DRAM bandwidth (which is a significant contributor to device energy consumption), compute logic cost and power consumption.
In this context, the following question naturally arises: what is an optimal bit depth at which to encode neural network weights and activations? There are several proposed number formats that reduce the bit depth, including Nvidia’s TensorFloat , Google’s 8-bit asymmetric fixed point (Q8A) and bfloat16 . However, while being a step in the right direction, these formats cannot be said to be optimal: for example, most of these formats store an exponent for each individual value to be represented, which can be redundant when multiple values share the same range. More importantly, they do not consider the fact that different parts of a neural network often have different bit-depth requirements. Some layers can be encoded with a low bit depth, while other layers (like input and output layers) need a higher bit depth. An example of this is MobileNet v3, which can be converted from 32-bit floating point to bit depths mostly in the 5-12 bits range (see Figure 1).
To read the full article, click here
Related Semiconductor IP
- LPDDR6/5X/5 PHY V2 - Intel 18A-P
 - MIPI SoundWire I3S Peripheral IP
 - P1619 / 802.1ae (MACSec) GCM/XTS/CBC-AES Core
 - LPDDR6/5X/5 Controller IP
 - Post-Quantum ML-KEM IP Core
 
Related Blogs
- Verification of UALink (UAL) and Ultra Ethernet (UEC) Protocols for Scalable HPC/AI Networks using Synopsys VIP
 - Take your neural networks to the next level with Arm's Machine Learning Inference Advisor
 - Mentium Accelerates Tape-out of AI Accelerator Chip for Space Applications on Synopsys Cloud
 - Ultra Ethernet Consortium Set to Enable Scaling of Networking Interconnects for AI and HPC
 
Latest Blogs
- ML-DSA explained: Quantum-Safe digital Signatures for secure embedded Systems
 - Efficiency Defines The Future Of Data Movement
 - Why Standard-Cell Architecture Matters for Adaptable ASIC Designs
 - ML-KEM explained: Quantum-safe Key Exchange for secure embedded Hardware
 - Rivos Collaborates to Complete Secure Provisioning of Integrated OpenTitan Root of Trust During SoC Production