Scaling On-Device GPU Inference for Large Generative Models

By Jiuqiang Tang ¹, Raman Sarokin ¹, Ekaterina Ignasheva ², Grant Jensen ¹, Lin Chen ¹, Juhyun Lee ¹, Andrei Kulik ¹, Matthias Grundmann ¹

¹Google LLC

²Meta Platforms, Inc

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.

To read the full article, click here

GPU IP Selector

Scaling On-Device GPU Inference for Large Generative Models

Related Semiconductor IP

Related Articles

Latest Articles

Related Articles

SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models

RoMe: Row Granularity Access Memory System for Large Language Models

SOC: Submicron Issues -> Large PLDs need own physical models

Verifying large models in RTL simulation

A 14ns-Latency 9Gb/s 0.44mm² 62pJ/b Short-Blocklength LDPC Decoder ASIC in 22FDX

Pipeline Stage Resolved Timing Characterization of FPGA and ASIC Implementations of a RISC V Processor

Lyra: A Hardware-Accelerated RISC-V Verification Framework with Generative Model-Based Processor Fuzzing

Leveraging FPGAs for Homomorphic Matrix-Vector Multiplication in Oblivious Message Retrieval

Extending and Accelerating Inner Product Masking with Fault Detection via Instruction Set Extension

Scaling On-Device GPU Inference for Large Generative Models

Subscribe to the Semi IP Hub Newsletter

Related Semiconductor IP

Related Articles

Latest Articles