How to implement double-precision floating-point on FPGAs
October 03, 2007 -- pldesignline.com
Floating-point arithmetic is used extensively in many applications across multiple market segments. These applications often require a large number of calculations and are prevalent in financial analytics, bioinformatics, molecular dynamics, radar, and seismic imaging, to name a few. As opposed to integer and single-precision 32-bit floating-point math, many applications demand higher precision, forcing the use of double-precision 64-bit operations. This article demonstrates the double-precision floating-point performance of FPGAs using two different approaches. First, a theoretical "paper and pencil" calculation is used to demonstrate peak performance. This type of calculation may be useful for raw comparison between devices, but is somewhat unrealistic as it assumes data is always available to feed the device, and does not take into account memory interfaces and latencies, place and route constraints, and other aspects of an actual FPGA design. Thus, secondly, the real results of a double-precision matrix multiply core that can easily be extended to a full DGEMM benchmark are demonstrated and the real-world constraints and challenges of achieving such results are discussed in detail.
Introduction
An increasing number of applications in many vertical market segments, from financial analytics to military radar to various imaging applications, are relying on computations with floating-point (FP) numbers. These applications implement various basic functions and methods such as fast Fourier transforms (FFTs), finite impulse response (FIR) filters, synthetic aperture radar (SAR), matrix math, and Monte Carlo. Many of these implementations use single-precision FP, where FPGAs can provide up to ten times the sustained performance compared to traditional CPUs. Recently, there has been increasing interest in double-precision performance to see how well FPGAs can compete with CPUs, especially for designs that have power and cooling constraints.
In a recent article titled FPGA Floating-Point Performance – A Paper and Pencil Evaluation, the author – Dave Strenski – discusses how to estimate the double-precision (64-bit) peak FP performance of an FPGA. In this article, his method is evaluated and – more importantly – he expands on it with "real-world" considerations for estimating the sustained FP performance in an FPGA. These considerations are validated using a matrix multiplication design running in an Altera Stratix II FPGA.
The double-precision general matrix multiply (DGEMM) routine is referenced here. DGEMM is a common building block for many algorithms and is the most important component of the scientific LINPACK benchmark commonly used on CPUs. The Basic Linear Algebra Subprograms (BLAS) include DGEMM in the Level 3 group. The DGEMM routine calculates the new value of matrix C based on the product of matrix A and matrix B and the previous value of matrix C using the formula C = áAB + âC (where á and â are scalar coefficients).
For this analysis, á = â = 1 is used, though any scalar value can be used as it can be applied during the data transfer in and out. As can be seen, this operation results in a 1:1 ratio of adders and multipliers. This analysis also takes into account the logic required for a microprocessor interface protocol core and adds the following considerations:
- Memory interface module for low latency access to local data
- Data paths from memory interface to FPGA memory
- Data path from FPGA memory to FP cores
- Decrease to FP core FMAX when the FPGA is full
- Unusable FPGA logic due to routing challenges of a full FPGA
The FPGA benchmark focuses on the performance of an implementation of the AB matrix multiplication with data from a locally attached SRAM. The effort to extend this core to include the accumulator to add the old value of C is a relatively minor effort.
Related Semiconductor IP
- RISC-V CPU IP
- AES GCM IP Core
- High Speed Ethernet Quad 10G to 100G PCS
- High Speed Ethernet Gen-2 Quad 100G PCS IP
- High Speed Ethernet 4/2/1-Lane 100G PCS
Related White Papers
- How to implement *All-Digital* analog-to-digital converters in FPGAs and ASICs
- How to Design SmartNICs Using FPGAs to Increase Server Compute Capacity
- Implementing LTE on FPGAs
- Using an interface wrapper module to simplify implementing PCIe on FPGAs
Latest White Papers
- New Realities Demand a New Approach to System Verification and Validation
- How silicon and circuit optimizations help FPGAs offer lower size, power and cost in video bridging applications
- Sustainable Hardware Specialization
- PCIe IP With Enhanced Security For The Automotive Market
- Top 5 Reasons why CPU is the Best Processor for AI Inference