BLAS is short for "Basic Linear Algebra Subprograms". It is a famous collection of routines for doing linear algebra. I just know from Oleksandr R. that mma can directly call BLAS under the context "LinearAlgebra`BLAS". I don't know why mma makes it undocumented.
Oleksandr R. provide an example of using GER
$$ \mathrm{GER}: \alpha, \vec{x}, \vec{y}, \mathbf{A} : \mathbf{A} \leftarrow \alpha \vec{x} {\vec{y}}^\mathrm{T} + \mathbf{A} $$
like this
A = RandomReal[1., {40000000, 2}];
alpha = 1.;
x = ConstantArray[1., Length[A]];
y = {1., 2.};
LinearAlgebra`BLAS`GER[alpha, x, y, A]; // AbsoluteTiming
to achieve the same result of
Transpose[{1., 2.} + Transpose@A]; // AbsoluteTiming
Oleksandr R.'s timing shows that BLAS approach is faster. While strangely, on my computer, I tried many times BLAS is much slower than double transpose with mma 10.3, windows system.
I also tried it on an HPC with linux version mma 10.3 installed
At this moment, it seems that maybe it is the problem of windows version 10.3. But after I tried on my friend's computer, I know it is not. Here is timing on his computer, also windows system
So what is wrong, how to explain this? My cpu is Intel Core i3-4500U, and 64 bit win8 system. My memory is 8GB, and there is 1.2GB for ramdisk



(LinearAlgebra\BLAS`GER[alpha, x, y, A]; // AbsoluteTiming)to be sure you are not timing something else. Also keep in mind, as @OleksandrR mentioned, that it is not a fair comparison if you are overwritingA`. – ilian Nov 21 '15 at 17:00In[1]definitely includes generatingAwhich is not correct. As for the Windows difference, I haven't been able to reproduce it. Could you perhaps include the exact CPU models and the amount of RAM for both machines? I'd also suggest setting$HistoryLength=0first. – ilian Nov 22 '15 at 01:12