How to transpose a matrix in CUDA/cublas?

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the leading dimension?

It is even better if it could be transposed during host->device transfer while keep the original data unchanged.


as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way:

    float* clone = ...;//copy content of A to clone
    float const alpha(1.0);
    float const beta(0.0);
    cublasHandle_t handle;
    cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m );

And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B

    cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n );

where C is also a row-major matrix.