I'm looking for a very bare bones matrix multiplication example for CUBLAS that can multiply M times N and place the results in P for the following code, using high-performance GPU operations:
float M[500][500], N[500][500], P[500][500];
for(int i = 0; i < Width; i++){
for(int j = 0; j < Width; j++)
{
M[i][j] = 500;
N[i][j] = 500;
P[i][j] = 0;
}
}
So far, most code I'm finding to do any kind of matrix multiplication using CUBLAS is (seemingly?) overly complicated.
I am attempting to design a basic lab where students can compare the performance of matrix multiplication on the GPU vs matrix multiplication on the CPU, presumably with increased performance on the GPU.
The SDK contains matrixMul which illustrates the use of CUBLAS. For a simpler example see the CUBLAS manual section 1.3.
The matrixMul sample also shows a custom kernel, this won't perform as well as CUBLAS of course.