We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to computation on a GPU rather than a CPU. For example, this page has some benchmarks with a Core 2 Quad and a GF 8800 GTX that show a 10-fold decrease in calculation time when using the GPU:
http://www.cv.nrao.edu/~pdemores/gpu/
However, in our product, size constraints restrict us to small form factors such as PC104 or Mini-ITX, and thus to rather limited embedded GPUs.
Is offloading computation to the GPU something that is only worth doing with meaty graphics cards on a proper PCIe bus, or would even embedded GPUs offer performance improvements?
Having developed FFT routines both on x86 hardware and GPUs (prior to CUDA, 7800 GTX Hardware) I found from my own results that with smaller sizes of FFT (below 2^13) that the CPU was faster. Above these sizes the GPU was faster. For instance, a 2^16 sized FFT computed an 2-4x more quickly on the GPU than the equivalent transform on the CPU. See a table of times below (All times are in seconds, comparing a 3GHz Pentium 4 vs. 7800GTX. This work was done back in 2005 so old hardware and as I said, non CUDA. Newer libraries may show larger improvements)
N FFTw (s) GPUFFT (s) GPUFFT MFLOPS GPUFFT Speedup 8 0 0.00006 3.352705 0.006881 16 0.000001 0.000065 7.882117 0.010217 32 0.000001 0.000075 17.10887 0.014695 64 0.000002 0.000085 36.080118 0.026744 128 0.000004 0.000093 76.724324 0.040122 256 0.000007 0.000107 153.739856 0.066754 512 0.000015 0.000115 320.200892 0.134614 1024 0.000034 0.000125 657.735381 0.270512 2048 0.000076 0.000156 1155.151507 0.484331 4096 0.000173 0.000215 1834.212989 0.804558 8192 0.000483 0.00032 2664.042421 1.510011 16384 0.001363 0.000605 3035.4551 2.255411 32768 0.003168 0.00114 3450.455808 2.780041 65536 0.008694 0.002464 3404.628083 3.528726 131072 0.015363 0.005027 3545.850483 3.05604 262144 0.033223 0.012513 3016.885246 2.655183 524288 0.072918 0.025879 3079.443664 2.817667 1048576 0.173043 0.076537 2192.056517 2.260904 2097152 0.331553 0.157427 2238.01491 2.106081 4194304 0.801544 0.430518 1715.573229 1.861814
As suggested by other posters the transfer of data to/from the GPU is the hit you take. Smaller FFTs can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small FFTs (below ~1024 points). If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
I would suggest using FFTW if you want a fast FFT implementation, or the Intel Math Library if you want an even faster (commercial) implementation. For FFTW, performing plans using the FFTW_Measure flag will measure and test the fastest possible FFT routine for your specific hardware. I go into detail about this in this question.
For GPU implementations you can't get better than the one provided by NVidia CUDA. The performance of GPUs has increased significantly since I did my experiments on a 7800GTX so I would suggest giving their SDK a go for your specific requirement.