General Purpose Neural Processing Unit (NPU)

Overview

Designed from the ground up to address significant machine learning (ML) inference deployment challenges facing system on chip (SoC) developers, the Chimera(TM) general purpose neural processor (GPNPU) family has a simple yet powerful architecture with demonstrated improved matrix-computation performance over the traditional approach. Its crucial differentiation is its ability to execute diverse workloads with great flexibility all in a single processor.

The Chimera GPNPU family provides a unified processor architecture that can handle matrix and vector operations and scalar (control) code in one execution pipeline. In conventional SoC architectures these workloads are traditionally handled separately by an NPU, DSP, and realtime CPU, requiring splitting code and tuning performance across two or three heterogenous cores. The Chimera GPNPU is a single software-controlled core, allowing for simple expression of complex parallel workloads.

The Chimera GPNPU is entirely driven by code, empowering developers to continuously optimize the performance of their models and algorithms throughout the device’s lifecycle. That's why it's ideal to run classic backbone networks, todays' newest Vision Transformers and Large Language Models, and whatever new networks are invented tomorrow.

Modern System-on-Chip (SoC) architectures deploy complex algorithms that mix traditional C++ based code with newly emerging and fast-changing machine learning (ML) inference code. This combination of graph code commingled with C++ code is found in numerous chip subsystems, most prominently in vision and imaging subsystems, radar and lidar processing, communications baseband subsystems, and a variety of other data- rich processing pipelines. Only Quadric’s Chimera GPNPU architecture can deliver high ML inference performance and run complex, data-parallel C++ code on the same fully programmable processor.

Compared to other ML inference architectures that force the software developer to artificially partition an algorithm solution between two or three different kinds of processors, Quadric’s Chimera processors deliver a massive uplift in software developer productivity while also providing current-day graph processing efficiency coupled with long-term future-proof flexibility.

Quadric’s Chimera GPNPUs are licensable processor IP cores delivered in synthesizable source RTL form. Blending the best attributes of both neural processing units (NPUs) and digital signal processors (DSPs), Chimera GPNPUs are aimed at inference applications in a variety of high-volume end applications including mobile devices, digital home applications, automotive and network edge compute systems.

Key Features

Hybrid Von Neuman + 2D SIMD matrix architecture
64b Instruction word, single instruction issue per clock
7-stage, in-order pipeline
Scalar / vector / matrix instructions modelessly intermixed with granular predication
Deterministic, non-speculative execution delivers predictable performance levels
AXI Interfaces to system memory (independent data and instruction access)
Instruction cache
Distributed tightly coupled local register memories (LRM) with data broadcast networks within matrix array allows overlapped compute and data movement to maximize performance
Local L2 data memory (multi-bank, configurable 1MB to 16MB) minimizes off-chip DDR access, lowering power dissipation
Optimized for INT8 machine learning inference (with optional FP16 support) plus 32b DSP ops
Compiler-driven, fine-grained clock gating delivers power savings

Benefits

Hybrid Von Neuman + 2D SIMD matrix architecture
64b Instruction word, single instruction issue per clock
7-stage, in-order pipeline
Scalar / vector / matrix instructions modelessly intermixed with granular predication
Deterministic, non-speculative execution delivers predictable performance levels
AXI Interfaces to system memory (independent data and instruction access)
Instruction cache
Distributed tightly coupled local register memories (LRM) with data broadcast networks within matrix array allows overlapped compute and data movement to maximize performance
Local L2 data memory (multi-bank, configurable 1MB to 16MB) minimizes off-chip DDR access, lowering power dissipation
Optimized for INT8 machine learning inference (with optional FP16 support) plus 32b DSP ops
Compiler-driven, fine-grained clock gating delivers power savings

Block Diagram

Technical Specifications

Short description

General Purpose Neural Processing Unit (NPU)

Vendor

Vendor Name

Request Info

General Purpose Neural Processing Unit (NPU)

Overview

Key Features

Benefits

Block Diagram

Technical Specifications

Related IPs