Your NPU Learned to Listen
Whisper runs end-to-end on Chimera. Encoder and decoder compiled as native GPNPU kernels, INT4 weights, FP16 attention, top-1 token match against the float32 reference. Scales to four cores with a flag, no recompile.
I ran a LibriSpeech clip through Whisper-tiny, encoder and decoder, INT4 weights with FP16 attention, compiled as native GPNPU kernels. The decoder's first-token prediction came out identical to the float32 reference: same top-1, same top-2, same vocabulary distribution. The chip ran the model and the model behaved.
The key is that this wasn't a GPU or a cloud endpoint. This was our general-purpose NPU running OpenAI's Whisper, with both transformer stages on the same silicon and compiled through the same toolchain, and no CPU fallback for the decoder.
That's new for us. Chimera has run vision models since day one; it ran DOOM to prove a serious point about our programmability. But Speech is a different modality, and Whisper is a different class of model, and getting it right meant solving problems that fixed-function accelerators haven't touched.
Why Whisper is hard for NPUs
Whisper isn't a single neural network, it's a pipeline:
- Audio enters as a raw waveform
- A feature extractor converts the waveform to an 80-channel log-mel spectrogram
- A convolutional front-end processes the spectrogram into post-conv features
- An encoder — four transformer layers with self-attention — compresses those features into hidden states
- A decoder — four transformer layers with self-attention and cross-attention to the encoder — auto-regressively generates text tokens
The encoder is a standard transformer: matrix multiplies, layer norms, self-attention. The operations map cleanly to a systolic array. That's the part that shows up in marketing benchmarks.
But the decoder is where pipelines break.
Cross-attention means the decoder's key and value projections come from the encoder's output, not from the decoder's own hidden states. Each decoder layer has causal self-attention (masked, sequential) and cross-attention (attending to all 1,500 encoder positions); that's eight weight-matrix projections per layer instead of four. The compute pattern is fundamentally different from the encoder.
Then there's quantization. Whisper's weights use MatMulNBits, a 4-bit integer format with per-group scaling factors, while the attention computation runs in FP16. In a single forward pass, the hardware is switching between INT4 weight lookups, FP16 softmax, and mixed-precision accumulation. INT8-only silicon doesn't have the registers for this, and INT8 silicon with an "FP16 mode" doesn't have the bandwidth for this.
On accelerators without general-purpose programmability, the encoder runs on the NPU but the decoder falls back to the CPU and the pipeline stalls at the handoff. You've accelerated half the model and decelerated the system in the process.
Marketing focuses on the encoder because the encoder isn't the hard part; the decoder is, and the decoder doesn't care about your marketing slides.
Explore NPU IP:
- GPNPU Processor IP - 32 to 864TOPs
- GPNPU Processor IP - 16 to 108 TOPs
- GPNPU Processor IP - 4 to 28 TOPs
What we built
We compiled the Whisper-tiny transformer, encoder and decoder, as native Chimera GPNPU kernels using Quadric's Chimera SDK. Both stages compile through the same toolchain and run on the same silicon.
| Stage | Runs on | Shape |
|---|---|---|
| Mel spectrogram | host CPU | [1, 80, 3000] |
| Conv front-end | host CPU (ORT) | [1, 1500, 384] post-conv features |
| Encoder (4 layers) | Chimera GPNPU | [1, 1500, 384] hidden states |
| Decoder prefill (4 layers) | Chimera GPNPU | hidden states + 4 prompt tokens → [1, 4, 51865] logits |
Why CPU for Mel and Conv?
That's a fair question. The mel spectrogram is FFT-heavy DSP (a few milliseconds of math on commodity x86) and the convolutional front-end is two small Conv1Ds plus positional embedding (under one percent of the model's FLOPs); together they're noise next to the 39M-parameter transformer that follows.
Chimera can run them. Convs are native ops on the GPNPU, and the 26.05 release adds depthwise and grouped FP16 variants. The DSP isn't theoretical either: the same 26.05 release ships an ASVspoof2021 anti-spoofing notebook where Hamming-windowed STFT, mel filterbank, log, DCT, and delta/delta-delta features all run on Chimera silicon via ChiPy as one CGC-compiled program. It can do raw waveform in and classifier output out with no host hop: the building blocks for an end-to-end audio pipeline on Chimera are already shipping.
They aren't in this demo because the SDK cuts the Whisper ONNX graph at the post-conv edge /Add_2_output_0 and ships the transformer block to CGC. That's where the FLOPs are, where the IP is, and where the architecture has to do work no fixed-function NPU can do: INT4 weight lookups feeding FP16 softmax feeding cross-attention feeding mixed-precision accumulation.
When the Whisper-specific front-end gets composed into a ChiPy pipeline alongside the encoder and decoder, the whole thing runs on Chimera. Quadric's ASVspoof pipeline shows that's a porting exercise, not a capability gap.
The SDK workflow
The Chimera SDK's ChimeraJob handles the full compilation pipeline:
- Start with an ONNX model from Hugging Face
- The SDK's helper functions replace
MatMulNBitssubgraphs and attention cores with Quadric custom ops (linalg::matMulNBits,nn::whisperAttention) - A calibration step runs 73 real audio samples through the model to generate tensor ranges, the per-tensor min/max values that determine fixed-point fractional bits for quantization.
Then ChimeraJob.compile() takes the custom-op-matched ONNX graph through relay conversion, quantization, and GPNPU code generation. The encoder compiles to a single GPNPU kernel, the decoder compiles to a single GPNPU kernel, and the encoder's ISS output feeds directly into the decoder's input with no intermediate dequantization and no float-reference shortcut.
hw_config = HWConfig(product="QC-P", ocm_size="32MB", macs_per_pe=16, num_cores=1)
encoder_job = ChimeraJob(
"encoder_model_matched_cut.onnx",
hw_config=hw_config,
trange_file="encoder_model.tranges",
attn_stub_src_path="whisper_matmul_stubs.hpp",
)
encoder_job.compile()
ONNX model in, ISS-verified kernel out, and the decoder follows the same pattern.
Scaling to four cores with a flag
The same wrapper builds on 1, 2, or 4 cores via sdk source --num-cores N with no code change in the kernel or recompile of the CGC artifact; the wrapper picks up the configured core count at build time and the runtime splits per-layer work across cores.
sdk source whisper_encoder.cpp --num-cores 4 --target QC-P \
--ocm-size 32MB --macs-per-pe 16 --clock-freq-ghz 1.7
That's the part that gets cheap when the underlying hardware is general-purpose: the kernel doesn't know whether it's running on one core or four, but multi-core scheduling is a CGC concern, not a kernel-author concern.
The proof
The encoder runs on the GPNPU under the ISS (cycle-accurate functional simulator) and produces hidden states with PSNR 25.81 dB against the float32 ONNX Runtime reference, mean absolute error 1.26. The decoder is then fed the encoder's quantized output, not a float reference, and produces logits with PSNR 24.09 dB, MAE 1.81. Error accumulates through two stages of INT4-weight + FP16-attention inference, and the signal still holds well above the 20 dB regression floor we ship with.
Where those numbers come from, layer by layer:
| Configuration | Encoder PSNR | Decoder PSNR |
|---|---|---|
| Single attention-layer kernel only | ~58.5 dB | ~70.3 dB |
| 4-layer stack, up to final LayerNorm | ~45 dB | — |
| 4-layer stack, fed float-reference input | — | ~48.8 dB |
| 4-layer stack, full end-to-end ISS pipeline | ~25.9 dB | ~24 dB |
The big drops are honest. End-to-end PSNR is the only number that matters for whether the chip's output is usable, and we ship the demo at that number rather than the per-stage one against a synthetic float input.
The test clip is 5.86 seconds from LibriSpeech: "Mister Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
Whisper generates text auto-regressively, one token at a time, and each token requires a full decoder forward pass. The decoder prefill — the first step in that sequence — is the most demanding: it processes the four bootstrap tokens with cross-attention to all 1,500 encoder positions and produces logits over a 51,865-token vocabulary. If the prefill gets the wrong answer, every subsequent token inherits the error. This is the hardest single inference step in the pipeline and it's the one we verified.
The decoder's top-1 prediction on the position that initiates generation matches the float32 reference, top-2 also agrees, and the remaining tokens in the top-5 set are the same on both sides, reordered within the noise floor. The chip is producing the same logit ranking the float model produces with no surprise drift or quantization tax that changes what the model says.
Word error rate on real audio
The notebook proves the chip is faithful on one sample, but the accuracy proof at scale comes from a separate path: 26.05 also lands end-to-end INT8 post-training quantization for encoder-decoder speech models. Benched on the full LibriSpeech test-clean split, openai/whisper-tiny at FP32 hits 7.54% WER; the INT8 port through the new quantizer hits 8.92% WER, within ±0.2 percentage points of the float reference. The full pipeline holds at scale, on real audio, against the float model.
What this means
Chimera has now run vision (YOLO), non-AI compute (DOOM), audio DSP (ASVspoof), and speech (Whisper). Same silicon. The hard part is never the matrix multiplies, it's the rest of the pipeline.
YOLO's challenge was Non-Maximal Suppression, a classical algorithm that most accelerators ship to the CPU. DOOM's challenge was everything: branching, pointer arithmetic, table lookups, zero matrix multiplies. ASVspoof's challenge was a Hamming-windowed FFT and a mel filterbank inside the same graph as the LSTM that consumes them. Whisper's challenge is the decoder, cross-attention, mixed-precision quantization, and a compute pattern that fixed-function accelerators can't express.
The Chimera architecture didn't change between demos. The compiler got smarter, the custom ops expanded, multicore arrived with a flag, but the silicon is the same general-purpose NPU that ran DOOM at 1,785 frames per second.
Your NPU runs neural networks; Quadric's runs programs, and now it listens, too.
Related Semiconductor IP
- GPNPU Processor IP - 32 to 864TOPs
- GPNPU Processor IP - 16 to 108 TOPs
- GPNPU Processor IP - 4 to 28 TOPs
- NPU
- NPU IP Core for Edge
Related Blogs
- Can You Rely Upon your NPU Vendor to be Your Customers' Data Science Team?
- Arm Ethos-N78 NPU: Unprecedented Machine Learning Capability at your Fingertips
- Why You Can't Trust Your NPU Vendor's Benchmarks
- Why You Should Create Your Own NPU Benchmarks
Latest Blogs
- UWB and 6G: Why Precision Sensing Will Define the Next Wireless Revolution
- Your NPU Learned to Listen
- AI-Accelerated Vulnerability Discovery: An Emergency in Software Security. Is Hardware Next?
- Securing SoC Reliability with Precision Power-On Reset IP for 0.8V Industrial Designs
- How data movement defines performance for AI silicon