Why performance is given in Gflop/s rather than actual time in seconds

Question

While reading many research-papers comparing parallel implementations of algorithms on different machines/architectures, I have noticed that the performance comparison is always listed in terms of GFlop/s and not the actual wall-clock time for the run in seconds. I am curious why this convention is used.

My only guess is that since every company advertises its device as having a certain peak flop-counts/second such research papers investigate how much of its "potential" has been achieved by listing the performance as "GFlop/s" for the particular application at hand.

Is this correct?

Also, say the performance of a $m$ x $n$ Matrix -- $n$ x $1$ Vector multiply has been stated as 4 GFlop/s. Is it reasonable to obtain the wall clock time in seconds by the following formula?

$$\frac{m(2n-1)}{4 * 10^9} \hspace{3mm} \text{seconds}$$ where $m(2n-1)$ is the number of floating point operations for the matrix-vector multiplication

score 6 · Answer 1 · answered Sep 22 '13 at 16:00

6

Traditionally, I'd say that people more or less understood the number of floating-point operations required to solve a problem ($O(n^3)$ and the like), and so reporting performance as a rate had meaning. Readers could then obtain the run time via the method you describe, but they could also compare that to the theoretical peak performance of the hardware used to find the efficiency of the method.

Personally, I like to see run time and performance rate reported.

The only thing that might be unreasonable about your method for determining the run time is that you have to be sure you have the right operational complexity formula. If the authors are using scheme with a different operation count, then you might get the calculation very wrong. For example, what you give is probably OK for dense matrix-vector multiplication, but if this was a sparse example, some sort of approach where the zeros were not multiplied was probably used. If you use your approach to calculate the run time without trying to account for the sparsity, then you'll run into trouble.

answered Sep 22 '13 at 16:00

Bill Barth

10,905
1
21
39

Thank you for the nice answer! Are there any standard software tools for measuring the flop rate of a code on a parallel / serial machines? – smilingbuddha Sep 22 '13 at 17:47
1

Tons. Google HPCToolkit, PAPI, PerfExpert, and VTune. Though, I would say that if you are on a modern Intel platform (Sandbridge or later), your measurement will be wildly inaccurate. SNB counts floating-point instructions issued not floating-point instructions executed or retired. This means that if the core has to reissue an FP instruction because a value has arrived from memory or is otherwise unavailable, you will over count instructions. This is extremely common. – Bill Barth Sep 22 '13 at 17:52
Yes I will be working on an Intel platform, and infact on a small Intel Xeon Phi cluster soon. If the Flop rate on modern Intel platforms is so inaccurate, are the figures in research papers documenting performance to be trusted? – smilingbuddha Sep 22 '13 at 18:02
Depends. You can actually count the operations involved in your implementation and then divide by the time, so you don't have to rely on the hardware to measure them. People who do this are presenting legitimate, reliable results, most likely. If they try to measure them without validation, then they may be problematic. Take everything with a grain of salt. – Bill Barth Sep 22 '13 at 18:12

score 2 · Answer 2 · answered Sep 22 '13 at 20:52

With the Linpack benchmark used in ranking supercomputers (and in many other situations where people are benchmarking high performance computing systems), it's common to scale the size of the problem and report the performance in Gflop/s for the particular value of the problem size $n$ at which the machine hits peak performance. In the case of the Linpack benchmark, the operation count is well known and standardized, so there's no question about how many floating point operations were actually performed in solving the benchmark problem at that size.

If two different machines have peak performance at very different problem sizes, then comparing the run times at these sizes (or even at one common problem size) wouldn't give a reasonable comparison.

On the other hand, if all you want is to solve a 100x100 system of linear equations, don't expect a machine that has a peak performance in the PetaFlop/s range to be much faster at solving your tiny system of equations than a good desktop machine!

Why performance is given in Gflop/s rather than actual time in seconds

2 Answers2