The code below performs the operation the same operation on gpuArrays
b in two different ways. The first part computes
(a'*(a*b)')' , while the second part computes
a*b*a. The results are then verified to be the same.
%function test clear rng('default');rng(1); a=sprand(3000,3000,0.1); b=rand(3000,3000); a=gpuArray(a); b=gpuArray(b); tic; c1=gather(transpose(transpose(a)*transpose(a*b))); disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s']) clearvars -except c1 rng('default'); rng(1) a=sprand(3000,3000,0.1); b=rand(3000,3000); a=gpuArray(a); b=gpuArray(b); tic; c2=gather(a*b*a); disp(['time for a*b*a: ' , num2str(toc),'s']) disp(['error = ',num2str(max(max(abs(c1-c2))))]) %end
(a'*(a*b)')' is roughly 4 times faster than computing
a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test time for (a'*(a*b)')': 0.43234s time for a*b*a: 1.7175s error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test time for (a'*(a*b)')': 1.717s time for a*b*a: 1.7153s error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform
(a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
- MATLAB is column major and CUDA is row major.
- There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
- Write your own column-major algebra libraries in CUDA
- Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes
b, transposes them, does the maths, and transposes the output when you
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling
gather(transpose( XX )) it just skips the output transposition is side. The same with
transpose(a*b). Note that
transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this
transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do
a*b*a in my PC.