79

This question is an extension of two discussions that came up recently in the replies to "C++ vs Fortran for HPC". And it is a bit more of a challenge than a question...

One of the most often-heard arguments in favor of Fortran is that the compilers are just better. Since most C/Fortran compilers share the same back end, the code generated for semantically equivalent programs in both languages should be identical. One could argue, however, that C/Fortran is more/less easier for the compiler to optimize.

So I decided to try a simple test: I got a copy of daxpy.f and daxpy.c and compiled them with gfortran/gcc.

Now daxpy.c is just an f2c translation of daxpy.f (automatically generated code, ugly as heck), so I took that code and cleaned it up a bit (meet daxpy_c), which basically meant re-writing the innermost loop as

for ( i = 0 ; i < n ; i++ )
    dy[i] += da * dx[i];

Finally, I re-wrote it (enter daxpy_cvec) using gcc's vector syntax:

#define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
vector(2,double) va = { da , da }, *vx, *vy;

vx = (void *)dx; vy = (void *)dy;
for ( i = 0 ; i < (n/2 & ~1) ; i += 2 ) {
    vy[i] += va * vx[i];
    vy[i+1] += va * vx[i+1];
    }
for ( i = n & ~3 ; i < n ; i++ )
    dy[i] += da * dx[i];

Note that I use vectors of length 2 (that's all SSE2 allows) and that I process two vectors at a time. This is because on many architectures, we may have more multiplication units than we have vector elements.

All codes were compiled using gfortran/gcc version 4.5 with the flags "-O3 -Wall -msse2 -march=native -ffast-math -fomit-frame-pointer -malign-double -fstrict-aliasing". On my laptop (Intel Core i5 CPU, M560, 2.67GHz) I got the following output:

pedro@laika:~/work/fvsc$ ./test 1000000 10000
timing 1000000 runs with a vector of length 10000.
daxpy_f took 8156.7 ms.
daxpy_f2c took 10568.1 ms.
daxpy_c took 7912.8 ms.
daxpy_cvec took 5670.8 ms.

So the original Fortran code takes a bit more than 8.1 seconds, the automatic translation thereof takes 10.5 seconds, the naive C implementation does it in 7.9 and the explicitly vectorized code does it in 5.6, marginally less.

That's Fortran being slightly slower than the naive C implementation and 50% slower than the vectorized C implementation.

So here's the question: I'm a native C programmer and so I'm quite confident that I did a good job on that code, but the Fortran code was last touched in 1993 and might therefore be a bit out of date. Since I don't feel as comfortable coding in Fortran as others here may, can anyone do a better job, i.e. more competitive compared to any of the two C versions?

Also, can anybody try this test with icc/ifort? The vector syntax probably won't work, but I would be curious to see how the naive C version behaves there. Same goes for anybody with xlc/xlf lying around.

I've uploaded the sources and a Makefile here. To get accurate timings, set CPU_TPS in test.c to the number of Hz on your CPU. If you find any improvements to any of the versions, please do post them here!

Update:

I've added stali's test code to the files online and supplemented it with a C version. I modified the programs to do 1'000'000 loops on vectors of length 10'000 to be consistent with the previous test (and because my machine couldn't allocate vectors of length 1'000'000'000, as in stali's original code). Since the numbers are now a bit smaller, I used the option -par-threshold:50 to make the compiler more likely to parallelize. The icc/ifort version used is 12.1.2 20111128 and the results are as follows

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./icctest_c
3.27user 0.00system 0:03.27elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./icctest_f
3.29user 0.00system 0:03.29elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./icctest_c
4.89user 0.00system 0:02.60elapsed 188%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./icctest_f
4.91user 0.00system 0:02.60elapsed 188%CPU

In summary, the results are, for all practical purposes, identical for both the C and Fortran versions, and both codes parallelize automagically. Note that the fast times compared to the previous test are due to the use of single-precision floating point arithmetic!

Update:

Although I don't really like where the burden of proof is going here, I've re-coded stali's matrix multiplication example in C and added it to the files on the web. Here are the results of the tripple loop for one and two CPUs:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 2500
 triple do time   3.46421700000000     
3.63user 0.06system 0:03.70elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_c 2500
triple do time 3.431997791385768
3.58user 0.10system 0:03.69elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 2500
 triple do time   5.09631900000000     
5.26user 0.06system 0:02.81elapsed 189%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_c 2500
triple do time 2.298916975280899
4.78user 0.08system 0:02.62elapsed 184%CPU

Note that cpu_time in Fortran measuers the CPU time and not the wall-clock time, so I wrapped the calls in time to compare them for 2 CPUs. There is no real difference between the results, except that the C version does a bit better on two cores.

Now for the matmul command, of course only in Fortran as this intrinsic is not available in C:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 2500
 matmul    time   23.6494780000000     
23.80user 0.08system 0:23.91elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 2500
 matmul    time   26.6176640000000     
26.75user 0.10system 0:13.62elapsed 197%CPU

Wow. That's absolutely terrible. Can anyone either find out what I'm doing wrong, or explain why this intrinsic is still somehow a good thing?

I didn't add the dgemm calls to the benchmark as they are library calls to the same function in the Intel MKL.

For future tests, can anyone suggest an example known to be slower in C than in Fortran?

Update

To verify stali's claim that the matmul intrinsic is "an order of magnitue" faster than the explicit matrix product on smaller matrices, I modified his own code to multiply matrices of size 100x100 using both methods, 10'000 times each. The results, on one and two CPUs, are as follows:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 10000 100
 matmul    time   3.61222500000000     
 triple do time   3.54022200000000     
7.15user 0.00system 0:07.16elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 10000 100
 matmul    time   4.54428400000000     
 triple do time   4.31626900000000     
8.86user 0.00system 0:04.60elapsed 192%CPU

Update

Grisu is correct in pointing out that, without optimizations, gcc converts operations on complex numbers to library function calls while gfortran inlines them in a few instructions.

The C compiler will generate the same, compact code if the option -fcx-limited-range is set, i.e. the compiler is instructed to ignore potential over/under-flows in the intermediate values. This option is somehow set by default in gfortran and may lead to incorrect results. Forcing -fno-cx-limited-range in gfortran didn't change anything.

So this is actually an argument against using gfortran for numerical calculations: Operations on complex values may over/under-flow even if the correct results are within the floating-point range. This is actually a Fortran standard. In gcc, or in C99 in general, the default is to do things strictly (read IEEE-754 compliant) unless otherwise specified.

Reminder: Please keep in mind that the main question was whether Fortran compilers produce better code than C compilers. This is not the place for discussions as to the general merits of one language over another. What I would be really interested in is if anybody can find a way of coaxing gfortran to produce a daxpy as efficient as the one in C using explicit vectorization as this exemplifies the problems of having to rely on the compiler exclusively for SIMD optimization, or a case in which a Fortran compiler out-does its C counterpart.

Pedro
  • 9,573
  • 1
  • 36
  • 45
  • One timing issue is that if your processor does frequency stepping/turbo mode, these results could be all over the map. – Bill Barth Feb 04 '12 at 21:37
  • That is correct, so I also ran the tests backwards (e.g. in the opposite order) and got the same results. I always set the CPU frequency scaling to "Performance" when I do benchmarks. – Pedro Feb 04 '12 at 21:54
  • 1
    Your daxpy_c.c is currently updating x with a multiple of x and not touching y at all. You may want to fix that to make it fair... – Jack Poulson Feb 04 '12 at 22:28
  • @Reid.Atcheson: I actually think that it is a decent first test. It would make sense to understand a single loop routine first, then move onto BLAS 2, and then BLAS 3. – Jack Poulson Feb 04 '12 at 22:36
  • 1
    @JackPoulson: Good catch, fixed and updated the results. – Pedro Feb 04 '12 at 22:45
  • 2
    Also, I am fairly certain that the difference is completely due to the manual unrolling in the Fortran version confusing the compiler. When I replace it with the same simple loop that you put into your C version, the performance between the two is almost identical. Without the change, the Fortran version was slower with the Intel compilers. – Jack Poulson Feb 04 '12 at 22:45
  • @Reid.Atcheson: Actually, there's not much optimization in my naive C routine either... Notice that the original code loops over four elements at a time, probably trying to coax the compiler into vectorizing that. This may actually do more harm than good since, by rounding N from the bottom to a factor of four, it can break the alignment of the data. – Pedro Feb 04 '12 at 22:47
  • Whoops, that's what I get for trying to read Fortran code. My fault. Rewrote history and deleted comment, didn't add anything.

    The only thing I can think of is jumping over 4 elements confuses compiler into not doing SSE? Most of the SSE registers only can only hold two doubles at a time (can't remember if Intel rolled out the 256 bit wide registers yet). I don't think this is likely though.

    Otherwise this is interesting. No idea why there's such a discrepancy.

    – Reid.Atcheson Feb 04 '12 at 22:57
  • @JackPoulson: Good point, can you post that as an answer? It's interesting that the manual unrolling, which was probably meant to help, is actually a problem for the compiler... – Pedro Feb 04 '12 at 23:00
  • Has there ever been a case where the difference between Fortran and C in the overall performance had notable influence on the programming language? Any experiences? – shuhalo Feb 05 '12 at 14:26
  • @Martin: Not that I know of, since most larger projects are either done only in one language or the other, hence the lack of actual comparisons. What there is, though, is an huge amount of speculation on the subject, which is why I posted this question :) – Pedro Feb 05 '12 at 14:54
  • ifort's matmul intrinsic is optimized for smaller matrices. If you do matmul 10000 times on smaller matrices (lets say of size 100) then you will find that matmul is an order of magnitude faster then triple do loops. It is upto the compiler developer how he optimizes matmul, ideally for large matrices matmul should internally switch to gemm but it is all upto the compiler writer as the standard doesnt say anything on how matmul should be implemented. – stali Feb 06 '12 at 13:34
  • @stali: I just checked using your own code, and that's just not true. I've updated the results accordingly. For future reference, please verify your claims before posting them here. The point of this question is not to fight over which language is best, but to verify if there is indeed a significant difference between the compilers. – Pedro Feb 06 '12 at 13:54
  • I have updated your post with my results. – stali Feb 06 '12 at 14:17
  • @stali: Please post your results to your answer. – Pedro Feb 06 '12 at 14:49
  • I didnt save the results and I dont have time to do it again. Next time dont delete my edits or at the least ask me. That way I can copy and at least paste it in a different post. There is good reason why posts are allowed to be edited by others. You were accusing me of verifying claims and when I do and post results you take it down. If you want to show others are your favorite language is faster than that is fine with me but for any meaningful discussion you should not just delete what others post at least not without asking. – stali Feb 06 '12 at 15:17
  • @stali: There seems to be some misunderstanding. As you can verify in the edit logs, I did not remove your results. I was asked to accept or decline your edit to my question, and I declined because I think the best, i.e. less confusing, place to put it is in your answer. I'm sorry for the lost results, but this is not a discussion on language merits, but on compiler performance and your results do not show that ifort produces better code than icc. At best, they show that ifort is better/worse than ifort. – Pedro Feb 06 '12 at 15:36
  • OK I see. For some reason it showed me the edits and then they were gone. – stali Feb 06 '12 at 17:55
  • So what is the conclusion to your answer? The actual Fortran and C codes on this very simple example are as fast. Is that the answer? – Ondřej Čertík Feb 25 '12 at 19:18
  • @OndřejČertík, yes, I guess that's the point... For the examples shown here, there is no difference. – Pedro Feb 26 '12 at 19:23
  • Fortran and C have slightly different semantics regrading arrays (as I remember, fortran does not allow arrays to overlap). This opens road for some classes of optimizations that are impossible for C, where arrays, passed as arguments, can overlap. Of course, not every compiler is able to benefit from such difference. – permeakra Dec 11 '12 at 07:55
  • 1
    @permeakra: Actually, the C99 standard specifies the restrict keyword which tells the compiler exactly that: to assume that an array does not overlap with any other data structure. – Pedro Dec 11 '12 at 09:45
  • Aren't the example codes tested way toooo trivial to tell anyone anything about the true speed comparisons of the two languages or which language optimizes better? –  Dec 11 '12 at 02:54
  • Pedro, can you please upload your source files to GitHub or some other more a permanent location? The following link no longer works: http://people.maths.ox.ac.uk/gonnet/fvsc/ – user1211719 Dec 29 '20 at 11:00

5 Answers5

38

The difference in your timings seems to be due to the manual unrolling of the unit-stride Fortran daxpy. The following timings are on a 2.67 GHz Xeon X5650, using the command

./test 1000000 10000

Intel 11.1 compilers

Fortran with manual unrolling: 8.7 sec
Fortran w/o manual unrolling: 5.8 sec
C w/o manual unrolling: 5.8 sec

GNU 4.1.2 compilers

Fortran with manual unrolling: 8.3 sec
Fortran w/o manual unrolling: 13.5 sec
C w/o manual unrolling: 13.6 sec
C with vector attributes: 5.8 sec

GNU 4.4.5 compilers

Fortran with manual unrolling: 8.1 sec
Fortran w/o manual unrolling: 7.4 sec
C w/o manual unrolling: 8.5 sec
C with vector atrributes: 5.8 sec

Conclusions

  • Manual unrolling helped the GNU 4.1.2 Fortran compilers on this architecture, but hurts the newer version (4.4.5) and the Intel Fortran compiler.
  • The GNU 4.4.5 C compiler is much more competitive with Fortran than for version 4.2.1.
  • Vector intrinsics allow the GCC performance to match the Intel compilers.

Time to test more complicated routines like dgemv and dgemm?

Jack Poulson
  • 7,599
  • 32
  • 40
18

I'm coming late to this party, so it's hard for me to follow the back-and-forth from all above. The question is big, and I think if you are interested it could be broken up into smaller pieces. One thing I got interested in was simply the performance of your daxpy variants, and whether Fortran is slower than C on this very simple code.

Running both on my laptop (Macbook Pro, Intel Core i7, 2.66 GHz), the relative performance of your hand-vectorized C version and the non-hand vectorized Fortran version depend on the compiler used (with your own options):

Compiler     Fortran time     C time
GCC 4.6.1    5408.5 ms        5424.0 ms
GCC 4.5.3    7889.2 ms        5532.3 ms
GCC 4.4.6    7735.2 ms        5468.7 ms

So, it just seems that GCC got better at vectorizing the loop in the 4.6 branch than it was before.


On the overall debate, I think one can pretty much write fast and optimized code in both C and Fortran, almost just like in assembly language. I will point out, however, one thing: just like as assembler is more tedious to write than C but gives you finer control over what is executed by the CPU, C is more low-level than Fortran. Thus, it gives you more control over details, which can help optimizing, where the Fortran standard syntax (or its vendor extensions) may lack functionality. One case is the explicit use of vector types, another is the possibility of specifying alignment of variables by hand, something Fortran is incapable of.

F'x
  • 531
  • 3
  • 17
  • welcome to scicomp! I agree that compiler versions are as important as language in this case. Did you mean 'of' instead of 'off in your last sentence? – Aron Ahmadia Feb 28 '12 at 13:10
9

The way I would write AXPY in Fortran is slightly different. It is the exact translation of the math.

m_blas.f90

 module blas

   interface axpy
     module procedure saxpy,daxpy
   end interface

 contains

   subroutine daxpy(x,y,a)
     implicit none
     real(8) :: x(:),y(:),a
     y=a*x+y
   end subroutine daxpy

   subroutine saxpy(x,y,a)
     implicit none
     real(4) :: x(:),y(:),a
     y=a*x+y
   end subroutine saxpy

 end module blas

Now let's call the above routine in a program.

test.f90

 program main

   use blas
   implicit none

   real(4), allocatable :: x(:),y(:)
   real(4) :: a
   integer :: n

   n=1000000000
   allocate(x(n),y(n))
   x=1.0
   y=2.0
   a=5.0
   call axpy(x,y,a)
   deallocate(x,y)

 end program main

Now let's compile and run it ...

login1$ ifort -fast -parallel m_blas.f90 test.f90
ipo: remark #11000: performing multi-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_iforttdqZSA.o

login1$ export OMP_NUM_THREADS=1
login1$ time ./a.out 
real    0 m 4.697 s
user    0 m 1.972 s
sys     0 m 2.548 s

login1$ export OMP_NUM_THREADS=2
login1$ time ./a.out 
real    0 m 2.657 s
user    0 m 2.060 s
sys     0 m 2.744 s

Notice that I am not using any loops or any explicit OpenMP directives. Would this be possible in C (that is, no use of loops and auto-parallelization)? I don't use C so I don't know.

stali
  • 1,759
  • 1
  • 10
  • 13
  • Automatic parallelisation is a feature of the Intel compilers (both Fortran and C), and not of the language. Hence the equivalent in C should also parallelize. Just out of curiosity, how does it perform for a more moderate n=10000? – Pedro Feb 04 '12 at 22:54
  • 4
    That was the whole point. Autopar is easier in Fortran due to that fact that Fortran (unlike C) supports whole array operations like matmult, transpose etc. So code optimization is easier for Fortran compilers. GFortran (which you have used) doesnt have the developer resources to optimize the Fortran compiler as their focus is currently to implement Fortran 2003 standard rather than optimization. – stali Feb 04 '12 at 22:59
  • Uhmm... The Intel C/C++ compiler icc also does automatic parallelization. I've added a file icctest.c to the other sources. Can you compile it with the same options as you used above, run it, and report the timings? I had to add a printf-statement to my code to avoid gcc optimizing everything out. This is just a quick hack and I hope it is bug-free! – Pedro Feb 04 '12 at 23:17
  • I've downloaded the latest icc/ifort compilers and done the tests myself. The the question has been updated to include these new results, i.e. that Intel's autovectorization works in both Fortran and C. – Pedro Feb 05 '12 at 14:56
  • 1
    Thanks. Yes I noticed that there is little difference perhaps because loops are simple and operations are Level 1 BLAS. But as I said before due to Fortran's ability to perform whole array operations and use of keywords such as PURE/ELEMENTAL there is more room for compiler optimization. How the compilers uses this information and what it really does is a different thing. You can also try matmul if you want http://bpaste.net/show/23035/ – stali Feb 05 '12 at 15:29
6

I think, it is not only interesting how a compiler optimizes the code for modern hardware. Especially between GNU C and GNU Fortran the code generation can be very different.

So let's consider another example to show the differences between them.

Using complex numbers, the GNU C compiler produces a large overhead for nearly very basic arithmetic operation on a complex number. The Fortran compiler gives much better code. Let's have a look at the following small example in Fortran:

COMPLEX*16 A,B,C
C=A*B

gives (gfortran -g -o complex.f.o -c complex.f95; objdump -d -S complex.f.o):

C=A*B
  52:   dd 45 e0                fldl   -0x20(%ebp)
  55:   dd 45 e8                fldl   -0x18(%ebp)
  58:   dd 45 d0                fldl   -0x30(%ebp)
  5b:   dd 45 d8                fldl   -0x28(%ebp)
  5e:   d9 c3                   fld    %st(3)
  60:   d8 ca                   fmul   %st(2),%st
  62:   d9 c3                   fld    %st(3)
  64:   d8 ca                   fmul   %st(2),%st
  66:   d9 ca                   fxch   %st(2)
  68:   de cd                   fmulp  %st,%st(5)
  6a:   d9 ca                   fxch   %st(2)
  6c:   de cb                   fmulp  %st,%st(3)
  6e:   de e9                   fsubrp %st,%st(1)
  70:   d9 c9                   fxch   %st(1)
  72:   de c2                   faddp  %st,%st(2)
  74:   dd 5d c0                fstpl  -0x40(%ebp)
  77:   dd 5d c8                fstpl  -0x38(%ebp)

Which are 39 bytes machine code. When we consider the same in C

 double complex a,b,c; 
 c=a*b; 

and take a look at the output ( done in the same way like above), we get:

  41:   8d 45 b8                lea    -0x48(%ebp),%eax
  44:   dd 5c 24 1c             fstpl  0x1c(%esp)
  48:   dd 5c 24 14             fstpl  0x14(%esp)
  4c:   dd 5c 24 0c             fstpl  0xc(%esp)
  50:   dd 5c 24 04             fstpl  0x4(%esp)
  54:   89 04 24                mov    %eax,(%esp)
  57:   e8 fc ff ff ff          call   58 <main+0x58>
  5c:   83 ec 04                sub    $0x4,%esp
  5f:   dd 45 b8                fldl   -0x48(%ebp)
  62:   dd 5d c8                fstpl  -0x38(%ebp)
  65:   dd 45 c0                fldl   -0x40(%ebp)
  68:   dd 5d d0                fstpl  -0x30(%ebp)

Which are 39 bytes machine code too, but the function step 57 refer to, does the proper part of the work and perform the desired operation. So we have 27 byte machine code to run the multi operation. The function behind is muldc3 provided by libgcc_s.so and has a footprint of 1375 byte in machine code. This slows down the code dramatically and gives an interesting output when using a profiler.

When we implement the BLAS examples above for zaxpy and perform the same test, the Fortran compiler should gives better results than the C compiler.

(I used GCC 4.4.3 for this experiment, but I noticed this behavior an other GCC releases to.)

So in my opinion we do not only think about parallelization and vectorization when we think about which is the better compiler we also have to look how basic things are translated to the assembler code. If this translation gives bad code the optimization can only use this things as input.

F'x
  • 531
  • 3
  • 17
M.K. aka Grisu
  • 726
  • 6
  • 15
  • 2
    I just cooked-up an example along the lines of your code, complex.c and added it to the code online. I had to add all the input/output to make sure nothing is optimized out. I only get a call to __muldc3 if I don't use -ffast-math. With -O2 -ffast-math I get 9 lines of inlined assembler. Can you confirm this? – Pedro Feb 06 '12 at 18:21
  • I've found a more specific cause for the difference in the generated assembler and have added this to my question above. – Pedro Feb 06 '12 at 19:01
  • Using -O2 lead the compiler to compute every what is possible at runtime, that's why such constructs are sometimes lost. The -ffast-math option should not be used in scientific computing when you want to rely on the outputs. – M.K. aka Grisu Feb 07 '12 at 08:25
  • 2
    Well, by that argument (no -ffast-math) you shouldn't use Fortran for your complex-valued computations. As I describe in the update to my question, -ffast-math or, more generally, -fcx-limited-range, forces gcc to use the same non-IEEE, restricted range computations as are standard in Fortran. So if you want the full range of complex values and correct Infs and NaNs, you shouldn't use Fortran... – Pedro Feb 07 '12 at 09:04
  • 2
    @Pedro: If you want GCC to behave like GFortran wrt. complex multiplication and division, you should use the -fcx-fortran-rules. – janneb Mar 02 '12 at 13:50
5

Folks,

I found this discussion very interesting, but I was surprised to see that re-ordering the loops in the Matmul example changed the picture. I don't have an intel compiler available on my current machine, so I am using gfortran, but a rewrite of the loops in the mm_test.f90 to

call cpu_time(start)  
do r=1,runs  
  mat_c=0.0d0  
     do j=1,n  
        do k=1,n  
  do i=1,n  
           mat_c(i,j)=mat_c(i,j)+mat_a(i,k)*mat_b(k,j)  
        end do  
     end do  
  end do  
end do  
call cpu_time(finish)  

changed the entire results for my machine.

The previous version timing results were:

#time ./mm_test_f 10000 100
 matmul    time   6.3620000000000001     
 triple do time   21.420999999999999     

whereas with the triple loops re-arranged as above yeilded:

#time ./mm_test_f 10000 100
 matmul    time   6.3929999999999998     
 triple do time   3.9190000000000005    

This is gcc/gfortran 4.7.2 20121109 on a Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

Compiler flags used were those from the Makefile I got here...

Schatzi
  • 51
  • 1
  • 4
  • 6
    That's not surprising, since the matrix storage in memory favors one order, i.e., if rows are stored contiguously, it's better to loop over rows innermost, since then you can load each row once into fast local memory compared to repeatedly loading (a slice of) it to access a single element. See http://stackoverflow.com/questions/7395556. – Christian Clason Jul 16 '13 at 09:27
  • I guess I was surprised that the "intrinsic matmul" wouldn't be coded to do things this way. It's substantially faster with the triple do ordered in the second way. It does seem to be in this compiler set, as earlier gfortran versions I can get to were more "flat" in their timing - it didn't matter which way you did the mult - it took nearly the same time. – Schatzi Jul 24 '13 at 03:42