High-Fidelity Conversion of Floating-Point Networks for Low-Precision Inference using Distillation
A major challenge for fast, low-power inference on neural network accelerators is the size of the models. There is a trend in recent years towards deeper neural networks with more activations and coefficients, and with this rise in model size, comes a corresponding rise in inference time and energy consumption per inference. This is particularly significant in resource-constrained mobile and automotive applications. Low-precision inference helps reduce inference cost by reducing DRAM bandwidth (which is a significant contributor to device energy consumption), compute logic cost and power consumption.
In this context, the following question naturally arises: what is an optimal bit depth at which to encode neural network weights and activations? There are several proposed number formats that reduce the bit depth, including Nvidia’s TensorFloat , Google’s 8-bit asymmetric fixed point (Q8A) and bfloat16 . However, while being a step in the right direction, these formats cannot be said to be optimal: for example, most of these formats store an exponent for each individual value to be represented, which can be redundant when multiple values share the same range. More importantly, they do not consider the fact that different parts of a neural network often have different bit-depth requirements. Some layers can be encoded with a low bit depth, while other layers (like input and output layers) need a higher bit depth. An example of this is MobileNet v3, which can be converted from 32-bit floating point to bit depths mostly in the 5-12 bits range (see Figure 1).
To read the full article, click here
Related Semiconductor IP
- Flexible Pixel Processor Video IP
- Bluetooth Low Energy 6.0 Digital IP
- MIPI SWI3S Manager Core IP
- Ultra-low power high dynamic range image sensor
- Neural Video Processor IP
Related Blogs
- Take your neural networks to the next level with Arm's Machine Learning Inference Advisor
- Partitioning Strategies to Optimize AI Inference for Multi-Core Platforms
- Windows on Arm is Ready for Prime Time: Native Chrome Caps Momentum for the Future of Laptop Computing
- The Evolving Role of Layout-Versus-Schematic (LVS) Checking for Modern SoCs
Latest Blogs
- Breaking the Silence: What Is SoundWire‑I3S and Why It Matters
- What It Will Take to Build a Resilient Automotive Compute Ecosystem
- The Blind Spot of Semiconductor IP Sales
- Scalable I/O Virtualization: A Deep Dive into PCIe’s Next Gen Virtualization
- UEC-LLR: The Future of Loss Recovery in Ethernet for AI and HPC