When to use WVM instead of C for CompilationTarget?

Question

In which cases is it advantageous to compile using the Wolfram Virtual Machine ("WVM") option rather than the C Code ("C") for CompilationTarget when using Compile?

Some questions on this site point to differences in memory usage, performance, compilation speed in certain cases. Besides that, there is a known list of compilable functions. Is WVM also limited to those?

Henrik Schumacher · Accepted Answer · 2024-01-03T17:25:03.537

IMHO, the only feasible usage cases of CompilationTarget -> "WVM" I can image are:

You work on a device without a C compiler (e.g., when you just don't have the rights to install one).
You want/have to leave the list Compile`CompilerFunctions[] of compilable functions for some reason. Then calling CompiledFunctionTool`CompilePrint on the CompiledFunction might reveal occurences of MainEvaluate in the pseudocode. A couple of times I have run into situations where the communication between the Wolfram Virtual Machine and the Mathematica Kernel was faster than between the library compiled from C and the Mathematica Kernel. But in most of these cases, when speed was important, I used LibraryLink to write a fully fleshed C++ program with the same capabilities. And the runtimes were always considerably faster than with "WVM".
Sometimes, "WVM" is also faster in calling precompiled functions like Dot many times on small matrices or vectors. But again, using LibraryLink to either call the correct BLAS routines or simply writing out the loops with trip counts known at compile time will speed up everything way more than Compile with CompilationTarget -> "WVM".

Some people might argue that the compilation times with CompilationTarget -> "WVM" are shorter. My point of view on this is that if you cannot ammortize the compilation time of CompilationTarget -> "C" vs. CompilationTarget -> "WVM", then you simply have a task that is not worth compilation in the first place. But of course, there may be exceptions to this.

Generally, CompilationTarget -> "WVM" seems to produce slower code. Using CompilationTarget -> "C" will force a mighty C compiler to look at your code and optimize (or at least improve it) it (for your hardware). So in summary, I recommend to consider CompilationTarget -> "WVM" merely as a fallback in case of the absence of a working C compiler. Personally, I almost never use it.

Edit: One of the few(?) exceptions

Okay, here is the one killer application that I found for "WVM": Threading operations like Dot over tensors. The key point is the option RuntimeAttributes -> {Listable} and that we call the function Dot that has a very efficient, compiled backend. Moreover, with Parallelization -> True it also allows for some parallelization (which, unfortunately, does not scale particularly well).

Here the code for several threaded versions of Dot.

ClearAll[TensorDotThreadWVM];
TensorDotThreadWVM[rankA_Integer, rankB_Integer] := 
  Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
   A . B,
   CompilationTarget -> "WVM",
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];
ClearAll[TensorDotThreadWVMMemoized];
TensorDotThreadWVMMemoized[rankA_Integer, rankB_Integer] := 
  TensorDotThreadWVMMemoized[rankA, rankB] = 
   Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
    A . B,
    CompilationTarget -> "WVM",
    RuntimeAttributes -> {Listable},
    Parallelization -> True
    ];
ClearAll[TensorDotThreadC];
TensorDotThreadC[rankA_Integer, rankB_Integer] := 
  Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
   A . B,
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];
ClearAll[TensorDotThreadCMemoized];
TensorDotThreadCMemoized[rankA_Integer, rankB_Integer] := 
  TensorDotThreadCMemoized[rankA, rankB] = 
   Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
    A . B,
    CompilationTarget -> "C",
    RuntimeAttributes -> {Listable},
    Parallelization -> True
    ];

For example, suppose we are given many, many matrices (rank $= 2$) of size {m1,m2} stored in some high rank packed array A and many, many matrices (rank $= 2$) of size {m2,m3} stored in some high rank packed array B like so:

n = 1000;
m1 = 12;
m2 = 8;
m3 = 11;
A = RandomReal[{-1, 1}, {n, n, m1, m2}];
B = RandomReal[{-1, 1}, {n, n, m2, m3}];

Now, we want to multiply each matrix in A with each corresponding matrix in B like so:

result = MapThread[Dot, {A, B}, 2]; // AbsoluteTiming // First

1.28851

Here are 4 alternatives to do the same:

resultWVM = TensorDotThreadWVM[2, 2][A, B]; // AbsoluteTiming // First
resultWVMMemoized = TensorDotThreadWVMMemoized[2, 2][A, B]; // AbsoluteTiming // First
resultC = TensorDotThreadC[2, 2][A, B]; // AbsoluteTiming // First
resultCMemoized = TensorDotThreadCMemoized[2, 2][A, B]; // AbsoluteTiming // First

0.154104

0.151299

0.600961

0.536399

Obviously, the two version that use CompilationTarget -> "C" are slower because of the compilation cost.

I am pretty sure that each of these methods calls the same BLAS routine blas_dgemm (or cblas_dgemm); otherwise the results would unlikely coincide bitwise:

Max[Abs[result - resultWVM]]
Max[Abs[result - resultWVMMemoized]]
Max[Abs[result - resultC]]
Max[Abs[result - resultCMemoized]]

TensorDotThreadCMemoized is memoized, i.e., the created CompiledFunction is stored and when it is called the next time:

So let's run this again (and let's measure time with RepeatedTiming):

result = MapThread[Dot, {A, B}, 2]; // RepeatedTiming // First
resultWVM = TensorDotThreadWVM[2, 2][A, B]; // RepeatedTiming // First
resultWVMMemoized = TensorDotThreadWVMMemoized[2, 2][A, B]; // RepeatedTiming // First
resultC = TensorDotThreadC[2, 2][A, B]; // RepeatedTiming // First
resultCMemoized = TensorDotThreadCMemoized[2, 2][A, B]; // RepeatedTiming // First

1.22665

0.172471

0.171726

0.326404

0.181896

Now we finally see that the TensorDotThreadWVM has already so little overhead, (i) that it is neither worth to memoize it and (ii) that compiling into a library does not help (or might even be slightly counterproductive).

"But of course, there may be exceptions to this." There probably are. I guess functions which auto-compile (e.g. Table) use the WVM for this reason. Compilation times can be very noticeably long on Windows. But something as advanced as auto-compilation would be a very small fraction of use-cases for users (as opposed to package developers). — Szabolcs, Jan 03 '24 at 12:51
Yeah. My thought is also that a package developer is typically willing to invest some more development time to squeeze out a good runtime for their users. So they would probably move large parts of the code to the C/C++ side anyways. At least that's what I do all the time (for this reason and because it is often quite cumbersome to penetrate that many layers of abstraction). — Henrik Schumacher, Jan 03 '24 at 13:20
The only time I WVM compilation is a good choice is for on-the-fly compilation of fire-and-forget functions. Basically like what Map and Table do. It usually doesn't speed things up dramatically, but it's not going to unexpectedly slow things down a whole lot either. — Sjoerd Smit, Jan 03 '24 at 13:36

When to use WVM instead of C for CompilationTarget?

1 Answers1