Algorithms FPGAs dominate CPUs on

anon picture anon · May 26, 2010 · Viewed 8.4k times · Source

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).

I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:

what tasks do FPGAs dominate CPUs on (in terms of pure speed) what tasks do FPGAs dominate CPUs on (in terms of work per jule)

Note: marked community wiki

Answer

Beni Cherniavsky-Paskin picture Beni Cherniavsky-Paskin · May 26, 2010

[no links, just my musings]

FPGAs are essentially interpreters for hardware! The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.

So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:

  • Massive opportunities for fine-grained parallelism.
    (Doing 4 operations at once doesn't count; 128 does.)
  • Opportunity for deep pipelining.
    This is also a kind of parallelism, but it's hard to apply it to a single task, so it helps if you can get many separate tasks to work on in parallel.
  • (Mostly) Fixed data flow paths.
    Some muxes are OK, but massive random accesses are bad, cause you can't parallelize them. But see below about memories.
  • High total bandwidth to many small memories.
    FPGAs have hundreds of small (O(1KB)) internal memories (BlockRAMs in Xilinx parlance), so if you can partition you memory usage into many independent buffers, you can enjoy a data bandwidth that CPUs never dreamed of.
  • Small external bandwidth (compared to internal work). The ideal FPGA task has small inputs and outputs but requires a lot of internal work. This way your FPGA won't starve waiting for I/O. (CPUs already suffer from starving, and they alleviate it with very sophisticated (and big) caches, unmatchable in FPGAs.) It's perfectly possible to connect a huge I/O bandwidth to an FPGA (~1000 pins nowdays, some with high-rate SERDESes) - but doing that requires a custom board architected for such bandwidth; in most scenarios, your external I/O will be a bottleneck.
  • Simple enough for HW (aka good SW/HW partitioning).
    Many tasks consist of 90% irregular glue logic and only 10% hard work ("kernel" in the DSP sense). If you put all that onto an FPGA, you'll waste precious area on logic that does no work most of the time. Ideally, you want all the muck to be handled in SW and fully utilize the HW for the kernel. ("Soft-core" CPUs inside FPGAs are a popular way to pack lots of slow irregular logic onto medium area, if you can't offload it to a real CPU.)
  • Weird bit manipulations are a plus.
    Things that don't map well onto traditional CPU instruction sets, such as unaligned access to packed bits, hash functions, coding & compression... However, don't overestimate the factor this gives you - most data formats and algorithms you'll meet have already been designed to go easy on CPU instruction sets, and CPUs keep adding specialized instructions for multimedia.
    Lots of Floating point specifically is a minus because both CPUs and GPUs crunch them on extremely optimized dedicated silicon. (So-called "DSP" FPGAs also have lots of dedicated mul/add units, but AFAIK these only do integers?)
  • Low latency / real-time requirements are a plus.
    Hardware can really shine under such demands.

EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.