Designed from the ground up to address significant machine learning (ML) inference deployment challenges facing system on chip (SoC) developers, the Chimera™ General Purpose Neural Processing Unit (GPNPU) family has a simple yet powerful architecture with demonstrated improved matrix-computation performance over the traditional approach. Its crucial differentiation is its ability to execute diverse workloads with great flexibility all in a single processor scaling from 1 TOP to 864 TOPs.
The Chimera GPNPU family provides a unified processor architecture that can handle matrix and vector operations and scalar (control) code in one execution pipeline. These workloads are traditionally handled separately by an NPU, DSP, and realtime CPU. The Chimera GPNPU is entirely driven by code, empowering developers to continuously optimize the performance of their models and algorithms throughout the device’s lifecycle.
Modern System-on-Chip (SoC) architectures deploy complex algorithms that mix traditional C++ based code with newly emerging and fast-changing machine learning (ML) inference code. This combination of graph code commingled with C++ code is found in numerous chip subsystems, most prominently in vision and imaging subsystems, radar and lidar processing, communications baseband subsystems, and a variety of other data- rich processing pipelines. Only Quadric’s Chimera GPNPU architecture can deliver high ML inference performance and run complex, data-parallel C++ code on the same fully programmable processor.
Compared to other ML inference architectures that force the software developer to artificially partition an algorithm solution between two or three different kinds of processors, Quadric’s Chimera processors deliver a massive uplift in software developer productivity while also providing current-day graph processing efficiency coupled with long-term future-proof flexibility.
The Chimera GPNPUs are licensable processor IP cores delivered in synthesizable source RTL form. Blending the best attributes of both neural processing units (NPUs) and digital signal processors (DSPs), Chimera GPNPUs are aimed at inference applications in a variety of high-volume end applications including mobile devices, digital home applications, automotive and network edge compute systems.
General Purpose Neural Processing Unit (NPU)
Overview
Key Features
- System Simplicity
- The solution enables hardware developers to instantiate a single core that can handle an entire ML workload plus the typical digital signal processor functions and signal conditioning workloads often intermixed with ML inference functions. Dealing with a single core drastically simplifies hardware integration and eases performance optimization. System design tasks such as profiling memory usage to ensure sufficient off-chip bandwidth are greatly simplified.
- Programming Simplicity
- the GPNPU architecture dramatically simplifies software development since matrix, vector, and control code can all be handled in a single code stream. ML graph code from the common training toolsets (Tensorflow, Pytorch, ONNX formats) is compiled by the Quadric toolset and can be merged with signal processing code written in C++, all compiled into a single code stream running on a single processor core.
- the toolset meets the demands of both hardware and software developers, who no longer need to master multiple toolsets from multiple vendors. The entire subsystem can be debugged in a single debug console. This can dramatically reduce code development time and ease performance optimization.
- This new programming paradigm also benefits the end users of the SoCs since they will have access to program all the GPNPU resources.
- Future Proof Flexibility
- A GPNPU can run anything written in C++. This is incredibly powerful since SoC developers can quickly write code to implement new neural network operators and libraries long after the SoC has been taped out. This eliminates fear of the unknown and dramatically increases a chip’s useful life.
- Again, this flexibility is extended to the end users of the SoCs. They can continuously add new features to the end products, giving them a competitive edge.
- Replacing the heterogenous ML subsystem comprised of separate NPU, DSP, and realtime CPU cores with one GPNPU has obvious advantages. By allowing vector, matrix, and control code to be handled in a single code stream, the development and debug process is greatly simplified while the ability to add new algorithms efficiently is greatly enhanced.
- As ML models continue to evolve and inferencing becomes prevalent in even more applications, the pay off from this unified architecture helps future proof chip design cycles.
Benefits
- Hybrid Von Neuman + 2D SIMD matrix architecture
- 64b Instruction word, single instruction issue per clock
- 7-stage, in-order pipeline
- Scalar / vector / matrix instructions modelessly intermixed with granular predication
- Deterministic, non-speculative execution delivers predictable performance levels
- AXI Interfaces to system memory (independent data and instruction access)
- Instruction cache
- Distributed tightly coupled local register memories (LRM) with data broadcast networks within matrix array allows overlapped compute and data movement to maximize performance
- Local L2 data memory (multi-bank, configurable 1MB to 16MB) minimizes off-chip DDR access, lowering power dissipation
- Optimized for INT8 machine learning inference (with optional FP16 support) plus 32b DSP ops
- Compiler-driven, fine-grained clock gating delivers power savings
Block Diagram
Technical Specifications
Related IPs
- ARC NPX Neural Processing Unit (NPU) IP supports the latest, most complex neural network models and addresses demands for real-time compute with ultra-low power consumption for AI applications
- Enhanced Neural Processing Unit for safety providing 32,768 MACs/cycle of performance for AI applications
- Enhanced Neural Processing Unit for safety providing 49,152 MACs/cycle of performance for AI applications
- Enhanced Neural Processing Unit providing 1024 MACs/cycle of performance for AI applications
- Enhanced Neural Processing Unit providing 12,288 MACs/cycle of performance for AI applications
- Enhanced Neural Processing Unit providing 16,384 MACs/cycle of performance for AI applications