Partitioning Strategies to Optimize AI Inference for Multi-Core Platforms
Not so long ago, AI inference at the edge was a novelty easily supported by a single NPU IP accelerator embedded in the edge device. Expectations have accelerated rapidly since then. Now we want embedded AI inference to handle multiple cameras, complex scene segmentation, voice recognition with intelligent noise suppression, fusion between multiple sensors, and now very large and complex generative AI models. Such applications can deliver acceptable throughput for edge products only when run on multi-core AI processors. NPU IP accelerators are already available to meet this need, extending to 8 or more parallel cores and able to handle multiple inference tasks in parallel. But how should you partition expected AI inference workloads for your product to take maximum advantage of all that horsepower? That’s the subject of this article.
Six paths to exploit parallelism for AI inference
As in any parallelism problem, we start with a defined set of resources for our AI inference objective: some number of available accelerators with local L1 cache, shared L2 cache and a DDR interface, each with defined buffer sizes. The task is then to map the network graphs required by the application to that structure, optimizing between total throughput and resource utilization.
One obvious strategy is in processing large input images which must be split into multiple tiles – partitioning by input map where each engine is allocated a tile. Here, multiple engines search the input map in parallel looking for the same feature. Conversely you can partition by output map – the same tile is fed into multiple engines in parallel, and you use the same model but different weights to detect different features in the input image at the same time.
To read the full article, click here
Related Semiconductor IP
- NPU
- Optional extension of NPX6 NPU tensor operations to include floating-point support with BF16 or BF16+FP16
- NPU IP for Data Center and Automotive
- General Purpose Neural Processing Unit (NPU)
- NPU IP for Embedded AI
Related Blogs
- SiFive Accelerates RISC-V Vector Integration in XNNPACK for Optimized AI Inference
- Leveraging AI to Optimize the Debug Productivity and Verification Throughput
- Rambus LPDDR5T/5X/5 Controller IP Turbocharges AI Inference Performance to 9.6 Gbps
- Ultra Ethernet Consortium Set to Enable Scaling of Networking Interconnects for AI and HPC
Latest Blogs
- Cadence Announces Industry's First Verification IP for Embedded USB2v2 (eUSB2v2)
- The Industry’s First USB4 Device IP Certification Will Speed Innovation and Edge AI Enablement
- Understanding Extended Metadata in CXL 3.1: What It Means for Your Systems
- 2025 Outlook with Mahesh Tirupattur of Analog Bits
- eUSB2 Version 2 with 4.8Gbps and the Use Cases: A Comprehensive Overview