Enhanced Neural Processing Unit providing 1024 MACs/cycle of performance for AI applications

Overview

The ARC® NPX Neural Processor IP family provides a high-performance, power- and area-efficient IP solution for a range of applications requiring AI-enabled SoCs. The ARC NPX6 NPU IP is designed for deep learning algorithm coverage including both computer vision tasks such as object detection, image quality improvement, and scene segmentation, and for broader AI applications, including generative AI.

The NPX6 NPU family offers multiple products to choose from to meet your specific application requirements. The architecture is based on individual cores that can scale from 1K MACs to 96K MACs for a single AI engine performance of over 250 TOPS and over 440 TOPS with sparsity. The NPX6 NPU IP includes hardware and software support for multi-NPU clusters of up to 8 NPUs achieving 3500 TOPS with sparsity. Advanced bandwidth features in hardware and software, and a memory hierarchy (including L1 memory in each core and a high-performance, low-latency interconnect to access a shared L2 memory) make scaling to a high MAC count possible. An optional tensor floating point unit is available for applications benefiting from BF16 or FP16 inside the neural network.

To speed application software development, the ARC NPX6 NPU Processor IP is supported by the MetaWare MX Development Toolkit, a comprehensive software programming environment that includes a neural network Software Development Kit (NN SDK) and support for virtual models. The NN SDK automatically converts neural networks trained using popular frameworks, like Pytorch, Tensorflow, or ONNX into optimized executable code for the NPX hardware.

The NPX6 NPU Processor IP can be used to create a range of products – from a few TOPS to 1000s of TOPS – that can be programmed with a single toolchain.

Key Features

  • Scalable real-time AI / neural processor IP with up to 3,500 TOPS performance
  • Supports CNNs, transformers, including generative AI, recommender networks, RNNs/LSTMs, etc.
  • Industry leading power efficiency (up to 30 TOPS/W)
  • One 1K MAC core or 1-24 cores of an enhanced 4K MAC/core convolution accelerator
  • Tensor accelerator providing flexible activation and support of Tensor Operator Set Architecture (TOSA)
  • Software Development Kit
    • Automatic mixed mode quantization tools
  • Bandwidth reduction through architecture and software tool features
  • Latency reduction through parallel processing of individual layers
  • Seamless integration with Synopsys ARC VPX vector DSPs
  • High productivity MetaWare MX Development Toolkit supports Tensorflow and Pytorch frameworks and ONNX exchange format

Block Diagram

Enhanced Neural Processing Unit providing 1024 MACs/cycle of performance for AI applications Block Diagram

Technical Specifications

×
Semiconductor IP