Is it possible for user written algorithms to outperform libraries' built-in optimized functions?

Question

I've always had this question in mind (even if it may sound vague), but in my numerical analysis courses we've always learned how to analyze and optimize code. However, since most linear algebra libraries (i.e. LAPACK, BLAS, etc.) have been continuously optimized by professional developers since the 1980s, I'm questioning what the purpose of having us write down algorithms and run them is. For example, we've had to write the QR iteration algorithm in MATLAB when there's already a built-in function (eig). Are there special cases where these built-in functions are vunerable for numerical instability where custom algorithms would outperform them?

Update : I want to thank you all for your interesting answers and comments I have read them all.

I also realized the beauty of this very narrow field of mathematics that has profound application in our work. The amount of effort that these mathematicians and computer scientist have invested in making these codes more powerful have made it beautiful for us (the ones who are learning about their theories) so that we can admire them while studying and knowing when to use these optimized built-in functions. While we write down a 15 line of codes to compute the Householder QR decomposition of some matrix $A$, I can only wonder how many lines of codes the built in function qr() has (probably in the 100s) so that it can serve us best in our computations in real life applications.

It may be useful to rewrite LAPACK and BLAS functions if your library targets a specific architecture, e.g., a custom-made vector processor, to either take advantage of the parallelism offered, or to overcome inherent limitations (e.g., no FPU). Alternatively, when your inputs are constrained, you may want to stray away from "general purpose" to implement niche low-complexity algorithms which only work for those inputs. On a general purpose PC, LAPACK and BLAS (here, I include library versions from Intel, etc.) may be difficult to beat. — Aravindh Krishnamoorthy, Aug 20 '21 at 14:41
It's important that we understand how and why these algorithms work. Because there are nuances, exceptions, & corner-cases we need to understand. The best way to learn is by doing it yourself. There are few shortcuts to real knowledge. And you can always read the source of LAPACK's QR; it's not as complicated as you might think. — alexchandel, Aug 20 '21 at 23:00

score 33 · Accepted Answer · answered Aug 18 '21 at 18:45

33

Although LAPACK has some incredibly optimized code, it can still be worth it to write your own version in a few cases.

The most important reason (and the reason they make you do it in your course). It's a great learning experience. You will never fully understand something like the QR algorithm (which is actually incredibly complicated to get the details right) unless you spend some time implementing it. A lot of the skills you pick up will translate when you write other numerical codes.
A lot of the code is contributed by the researchers who develop the algorithms and it takes a lot of effort. Researchers are usually evaluated on how many papers they publish, not the code they write for libraries, so there can be a lack of incentive for researchers to contribute these codes. As an example, it was only last year that an optimized version of the QZ algorithm was contributed, although a paper was published on it in 2006.
LAPACK code needs to be robust and general. This is mostly important for the small scale routines. I'll talk about DLARTG as an example. That routine calculates a Givens rotation. It has a lot of scaling, safety checks and exceptions to make sure it is always accurate for all the input you could ever give it. If you know that for your application, these safety checks will not be necessary or you don't need the strong accuracy guarantees, you can achieve faster results.

answered Aug 18 '21 at 18:45

Thijs Steel

1,693
8
16

16

I think point 3 can be generalized to something like "special cases may be solved quicker than by using the general strategy, so it may be worthwhile to implement a special solution". One simple example is limited array sizes: If you know your data never exceeds the stack size (or if you can even tailor-fit data chunks into the cache of the CPU you know you have) you can exploit data locality and caches in a way a general library cannot. – Peter - Reinstate Monica Aug 19 '21 at 09:39
7

If you choose to implement special cases yourself in real-world code, make sure you know enough numerical analysis to be sure you really have a special case. Otherwise, you may just get low quality answers without realizing it. As a trivial example, if you can't think of several reasons why naively implementing the high-school formula for solving a quadratic is the wrong way to compute the roots, you don't understand what is going on! – alephzero Aug 19 '21 at 13:21
@alephzero -- I can only think of two special cases for quadratic equations. i) The equation has complex or imaginary roots (B^2 < 4ac), which isn't really that special and should be understood by all that are involved. ii) As "A" in Ax^2 + Bx + C = 0 approaches 0, it seems that this equation does not approach a line-equation (Bx +C) as one might naively think, therefore studying that behavior would be interesting. See also: https://math.stackexchange.com/questions/1399140/the-behavior-of-quadratic-formula-in-the-limit-a-to-0/1399144 – JosephDoggie Aug 19 '21 at 14:28
See also: https://math.stackexchange.com/questions/395004/quadratic-equation-with-0-coefficients – JosephDoggie Aug 19 '21 at 14:34
3

@JosephDoggie special cases for quadratic equations include the situations where $b^2 \gg 4ac$ and the calculations are numerically unstable – svavil Aug 19 '21 at 16:46
1

Another thing to add to 3 could be, in the case of small known-sized inputs, it could be possible to do things like loop unrolling. Doing that manually would be a bit extreme nowadays, but you might need to do a little coaxing to get the compiler to realize it can unroll.
I also vaguely wonder if there's room for a thought about sparse matrices and/or RCIs, but the spirit of the question seems more in the realm of low-level optimizations...
– Zwuwdz Aug 19 '21 at 17:27
@svavil -- Note that as a approaches 0, in general, b^2 will be much greater than 4ac unless b is also approaching 0 or something – JosephDoggie Aug 19 '21 at 20:54
3

@JosephDoggie Consider an equation like $x^2 - 100000000.000000001x + 1 = 0$ where the roots are $10^8$ and $10^{-8}$. Or an equation like $10^{300}x^2 + 3\times10^{300}x + 2\times10^{300} = 0$ where calculating $b^2-4ac$ will overflow (in 64-biit IEEE floating point) even though the roots are easy to calculate by hand. Now think of some more cases for yourself :) – alephzero Aug 19 '21 at 23:05
4

@JosephDoggie In fact the $\pm$ in $b \pm\sqrt{b^2-4ac}$ always causes unnecessary loss of precision. A better way is to calculate the root with largest modulus using the formula, and use the fact that the product of the roots is $c/a$. – alephzero Aug 19 '21 at 23:13

score 8 · Answer 2 · answered Aug 19 '21 at 12:35

@ThiysSteel covers a lot, here is another perspective I find important:

Even if you have available excellent implementations of any algorithm you might need, you still need to understand some of the internals in order to make good use of your library. Some examples:

Any linear-algebra library can compute an inverse matrix really fast and as accuratly as mathematics allows. But you have to know a little theory in order to know that you should still avoid using that if at all possible. (Matrix decomposition and solving without explicitly computing the inverse is better, generally speaking).
Your library can probably compute eigenvalues of any arbitrary matrix, so it might be tempting to just use the general-purpose function and be done with it. But with some background knowledge you can appreciate how much(!) better the special-purpose algorithms for symmetric/hermitian matrices work. So it is reasonable to make a considerable effort to transform you problem to that case.
In order to achieve optimal performance you have to think about data-layout. Row-Major? Column-Major? AoS or SoA (in case of complex numbers)? A good library will be able to work with any of these, but you need to know a little about the inner workings in order to make a informed choice (though in practice: try it out an measure performance. )

The situation is somewhat similar to writing assembly code. You should pretty much never do so by hand. But knowing (some) assembly will make you a better programmer in any high-level language.

“inverse matrix...Matrix decomposition...arbitrary matrix...symmetric/hermitian matrices...Row-Major” – I'd add that often, best performance is achieved by avoiding any matrices at all from turning up. For algorithms like conjugate gradients and Lanczos you only need a linear operation, which may well be implemented more efficiently as a function than a matrix. — leftaroundabout, Aug 20 '21 at 14:07

score 7 · Answer 3 · answered Aug 19 '21 at 03:08

To clarify @ThiysSteel's good answer:

The point is not to attempt to surpass the optimization of code written by very experienced people, who'd wrangled with it for decades.

The point is to acquaint yourself with the issue that were addressed in that code, to appreciate the "niceties", and not be naive about related matters in the future.

(It's wise/fair to learn from other people. You don't have to reinvent bad wheels yourself. "They sell them!" Both bad wheels and good... :)

score 7 · Answer 4 · edited Aug 21 '21 at 18:12

I used to try to optimize code via using assembly language (as opposed to C). I had some clear success, where a real-time microphone array worked with assembly language, but it would not work at all in the C language.

However, since that time, more than 20 years have passed. It's very difficult to achieve this sort of improvement nowadays. Compilers have become better, and the same would be true for libraries. People still use packages based on LINPACK (e.g. linear algebra for eigenvalue/eigenvector routines), the core of which was written in Fortran in the 1960's! There have been new wrappers for C#, etc.

Any time one writes in a lower-level language, there are increased risks of errors and unreadable code, and of course a problem for anyone who has to maintain the code besides the original author.

Also, there may be special cases, edge cases, etc. in the programming problem to be solved. However, these would probably be best handled in the 'calling program' (i.e.:

if (such and such) 
     call the library 
  else 
     implement special rule for edge case

... etc.....

For example, one could avoid divide by zero if the absolute value of a floating point number is less than some threshold, and then call a calculation involving a division.

score 6 · Answer 5 · answered Aug 19 '21 at 20:44

This might be tangent but i feel it worth adding. A common way for user code to beat libraries is to exploit structure the libraries are not aware of. Block diagonal matrices are easier to work in structured form then dense matrix that contain the complete matrix and lots of zeros. Cyclic boundary conditions in PDEs can be solved more easily than working on the over the dense system matrix. In addition the rise of new datatypes in hardware as Float16/BFloat16 means old libraries need to catch up to support those. Similarly if you have some composed numerical type like complex numbers or dual numbers writing and optimizing your own functionality will beat emulating the math with real number matrices due to data locality.

score 6 · Answer 6 · answered Aug 20 '21 at 10:07

6

You mention LAPACK and BLAS. One really sobering thing to realise is that they have been written in FORTRAN and that means code looking straightforward (and thus easy to write by people specialising in mathematics rather than programming) can be heavily optimised by the compiler.

The same does not work in C or C++. For starters, there is no standardised way to represent multidimensional matrices to subroutines (some recent C standard got variable-dimensioned arrays in, just to have the next standard declare support optional again). For another, arrays passed to FORTRAN subroutines are guaranteed to point to non-overlapping structures. That allows significant optimisations and reordering of load/store instructions. In connection with standard representation and passing conventions, strength reduction and pointer arithmetic are expected to be done by the compiler as a rule.

Recent C implementations have gotten a "restrict" keyword which can be employed to similar effect but conflicts with variable-dimension matrix syntax and leads to inscrutable code only writable and readable by experts.

C++ can get around some of those problems with wagonloads of template programming, but every linear algebra package implements its own matrix representations which are not compatible with one another.

As a result, creating good interoperable linear algebra libraries in C/C++ is still playing catch-up with FORTRAN code half a century old and many of the better C/C++ implementations of those libraries essentially call extern "FORTRAN" code routines that have been compiled using a good FORTRAN compiler.

To a good degree, that is the problem of C interchanging arrays and pointers particularly in function call elements and exposing the internals of array implementations in that manner, making it hard for the compiler to assume anything useful about storage layout and making more-dimensional arrays with sizes not fixed at compilation awkward to deal with.

So when trying to outperform "standard routines", you may not just be fighting against a lot of accumulated knowledge but you may also be hampered down by competing with code that has not even been forced through a compiler that got its hands tied behind its back due to language semantics unsuitable for numeric work.

answered Aug 20 '21 at 10:07

6

It's been a trope for 50 years that C (and later C++) can not be optimized as well by compilers as Fortran. And the trope has been wrong for 25 of these years. – Wolfgang Bangerth Aug 20 '21 at 20:23
1

@WolfgangBangerth: C has no way of giving a compiler all the information that would be needed in many cases to let it produce the best possible machine code that would meet application requirements. A qualifier like restrict would be good if the concept of "based upon" weren't defined in such a clumsy fashion with corner cases that are absurd, ambiguous, and unworkable, but beyond that a good language should include constructs that would allow a compiler to skip any or all remaining executions of a loop, at its leisure, execute a function as many or as few times as convenient, etc. – supercat Aug 20 '21 at 22:47
1

@WolfgangBangerth: If a function discovers, while performing some task that fills an array with computed values, that the task can't be completed in useful fashion and should be aborted as soon as convenient, and if client code won't care how much or little of the array was written in that case, it should be possible to tell the compiler that no further iterations are required but that performing later iterations before determining that they'll be necessary would be harmless. I don't know if Fortran is better than C, but C is really not very good from an optimization standpoint. – supercat Aug 20 '21 at 22:52
2

@supercat I don't care about arguing academic points about how well designed a certain programming language is or not. Empirically, C/C++ do not lead to slower code, and have not for a very long time. All modern versions of BLAS and LAPACK are written in C/C++ for this reason. (Examples: BLIS, Eigen.) – Wolfgang Bangerth Aug 20 '21 at 23:27
4

@supercat But to be pragmatic about the issue, let me also explain why the language doesn't matter. For the past nearly 20 years, dealing with dense linear algebra (and nearly everything else), it hasn't been any more about small optimizations and being able to eliminate an instruction here or there (think about the difference between memcpy and memmove), but it's exclusively been about data movement and memory accesses. The only difference between a good and a bad BLAS implementation these days is whether it blocks memory accesses. Compiler optimizations are 100% unimportant. – Wolfgang Bangerth Aug 21 '21 at 02:04
1

@WolfgangBangerth: There are many application where behavior that is observably inconsistent with sequential program execution would sometimes be tolerable, provided there are limits to such inconsistency. The C Standard, however, is designed around the principle that the only way to allow deviations from sequential execution is to classify as Undefined Behavior any chain of events that might make such deviation observable. – supercat Aug 21 '21 at 18:11
1

@WolfgangBangerth: As a simple example, a good programming language should include a notation to indicate that a function may be used as though it's idempotent, without having to promise that it is completely free of side effects. If such a function happens to write to a log when it's called, and it gets called before the start of a loop which happens to exit before the actual call to the function, that would result in a log entry being written despite the fact that execution never actually reached the function call, but that shouldn't invoke "jump the rails" Undefined Behavior. If... – supercat Aug 21 '21 at 18:16
1

...the programmer needs to know whether execution actually reached a call to the function, the programmer would need to refrain from annotating the function to say such timing doesn't matter, but if the programmer doesn't mind the fact that the function would write the log entry without "actually" being called, a the programmer should be able to invite such hoisting. – supercat Aug 21 '21 at 18:18
1

@supercat These are now questions totally unrelated to the Fortran vs C discussion. – Wolfgang Bangerth Aug 22 '21 at 19:28
1

@WolfgangBangerth: I was under the impression that Fortran had some "parallel do" constructs with somewhat loose semantics, as well as ways of designating that compilers may treat functions as "pure", though I don't know what the exact semantics of such constructs are. If Fortran handles them better than C, they would give an edge to Fortran. If not, well, it should. – supercat Aug 22 '21 at 19:32
1

@supercat You can do parallel for in both languages via OpenMP. But the point I made above stays true: You don't gain performance in dense linear algebra these days if you do not explicitly model data movement from main memory into the cache and from the cache onto the processor. 98% of performance improvements in BLAS over the past 20 years have come from explicitly blocking data access based on data movement models. Which language you do this in is unimportant, as are the language constructs the language offers. – Wolfgang Bangerth Aug 22 '21 at 22:01

score 4 · Answer 7 · answered Aug 21 '21 at 15:06

The point of learning how to write "the QR iteration algorithm" is not because eig will be slower. It is because "the QR iteration algorithm" is easy.

When you learn things, you are given easy tasks.

We don't teach children how to add 1+1 because we need the answer. We don't teach them how to manually carry on paper when they add 7+17 because that is the optimal way to do it. We don't teach long division of 3 into 36 for the same reason.

We teach those because it is easy, and the skills you learn are useful for non-easy problems.

Not every problem you'll want to solve in MATLAB will have a pre-written solution for it. Some of the problems with pre-written solutions will be harder to find than (if you have some knowledge) just writing it yourself, and the version you write yourself will be good enough.

Optimization techniques often cross from one algorithm to another. So when you learn how to write an algorithm, and how to modify it to make it faster, and even learn the insane things the "professional" versions do, you gain skills that may apply the next time you need an algorithm that isn't pre-written for you.

Beyond that, it is possible that this isn't the end. You may be the next one who implements or optimizes those algorithms. You might invent a new mathematical algorithm, and have to write out a working sketch of it to see how it acts. You might leave scientific computing, and become a computer programmer (I hear they are hiring), and writing algorithms and optimizing them is a useful skill for that job.

score 3 · Answer 8 · answered Aug 21 '21 at 10:38

In some special cases it may be if you have a special case application that's very different for where the libraries are intended.

I have written an optimization application that needs to do 80 million very simple linear programming solutions per second on 8 cores. That's 10 million per second per core. I tested a number of available linear programming libraries. Every single one of them was too slow by a factor of over 10x.

In the end I had to write a system that picks some of the inequalities, treats them like equations, solves them (I solved them algebraically to do as much work as I can in program writing stage leaving as little work as possible in execution stage), calculates the value of the optimized function, and then the same is done for a different set of picked inequalities, and the one with most optimal value wins.

If I had no knowledge about the nature of linear programming problems, namely that the optimal point has to lie on the boundary of the inequalities, this custom solver would have been impossible to write.

It seems that the available libraries were intended for cases where the linear programming problem is complex as opposed to being simple, and the sheer amount of CPU time used to solve the problem is needed to solve one complex problem rather than solving 10 million simple problems with slightly different parameters.

Is it possible for user written algorithms to outperform libraries' built-in optimized functions?

8 Answers8