Implementing floating-point DSP on FPGAs
For developers using FPGAs for the implementation of floating-point DSP functions, one key challenge is how to decompose the computation algorithm into sequences of parallel hardware processes while efficiently managing data flow through the parallel pipelines of these processes.
In this article, we'll discuss our experiences exploring architectures with Xilinx PicoBlaze controllers, and present a design strategy employing the ESL techniques of model-based and C-based design to demonstrate how you can rapidly integrate highly parameterizable DSP hardware primitives into power-efficient high-performance implementations in Spartan devices.
Hardware Acceleration and Reuse
High-performance implementations of floating-point DSP algorithms in FPGAs require single-cycle parallel memory accesses and effective use of pipelined arithmetic operators. Many common DSP vector and matrix operations can be split into batch calculations fulfilling these requirements. Our architectures comprise of Xilinx PicoBlaze worker processors, each with a dedicated DSP hardware accelerator (Figure 1). Each worker can do preparatory tasks for the next batch in parallel with its hardware accelerator. Once the DSP hardware accelerator finishes the computation, it issues an interrupt to the worker. The worker's job is to combine the accelerated parts of the computation into a complete DSP algorithm.

Figure 1. PicoBlaze-based architecture for floating-point DSP. The DSP hardware accelerators are modeled and implemented using Celoxica DK.
It is ideal if you limit implementations to the batch operations of each worker starting in a block RAM, performing a relatively simple sequence of pipelined operations at the maximum clock speed and returning the result(s) back to another block RAM. You can effectively map these primitives to hardware, including the complete autonomous data-flow control in hardware. You can also code the related dedicated generators of address counters and control signals in Handel-C, using several synchronized do-while loops. Simulink is effective for fast derivation of bit-exact models of the batch calculations in DSP hardware accelerators.
To read the full article, click here
Related Semiconductor IP
- Sine Wave Frequency Generator
- CAN XL Verification IP
- Rad-Hard GPIO, ODIO & LVDS in SkyWater 90nm
- 1.22V/1uA Reference voltage and current source
- 1.2V SLVS Transceiver in UMC 110nm
Related White Papers
- Implementing LTE on FPGAs
- Using an interface wrapper module to simplify implementing PCIe on FPGAs
- How to implement double-precision floating-point on FPGAs
- Implementing floating-point algorithms in FPGAs or ASICs
Latest White Papers
- OmniSim: Simulating Hardware with C Speed and RTL Accuracy for High-Level Synthesis Designs
- Balancing Power and Performance With Task Dependencies in Multi-Core Systems
- LLM Inference with Codebook-based Q4X Quantization using the Llama.cpp Framework on RISC-V Vector CPUs
- PCIe 5.0: The universal high-speed interconnect for High Bandwidth and Low Latency Applications Design Challenges & Solutions
- Basilisk: A 34 mm2 End-to-End Open-Source 64-bit Linux-Capable RISC-V SoC in 130nm BiCMOS