IMHO, the only feasible usage cases of CompilationTarget -> "WVM" I can image are:
- You work on a device without a C compiler (e.g., when you just don't have the rights to install one).
- You want/have to leave the list
Compile`CompilerFunctions[] of compilable functions for some reason. Then calling CompiledFunctionTool`CompilePrint on the CompiledFunction might reveal occurences of MainEvaluate in the pseudocode. A couple of times I have run into situations where the communication between the Wolfram Virtual Machine and the Mathematica Kernel was faster than between the library compiled from C and the Mathematica Kernel. But in most of these cases, when speed was important, I used LibraryLink to write a fully fleshed C++ program with the same capabilities. And the runtimes were always considerably faster than with "WVM".
- Sometimes,
"WVM" is also faster in calling precompiled functions like Dot many times on small matrices or vectors. But again, using LibraryLink to either call the correct BLAS routines or simply writing out the loops with trip counts known at compile time will speed up everything way more than Compile with CompilationTarget -> "WVM".
Some people might argue that the compilation times with CompilationTarget -> "WVM" are shorter. My point of view on this is that if you cannot ammortize the compilation time of CompilationTarget -> "C" vs. CompilationTarget -> "WVM", then you simply have a task that is not worth compilation in the first place. But of course, there may be exceptions to this.
Generally, CompilationTarget -> "WVM" seems to produce slower code. Using CompilationTarget -> "C" will force a mighty C compiler to look at your code and optimize (or at least improve it) it (for your hardware).
So in summary, I recommend to consider CompilationTarget -> "WVM" merely as a fallback in case of the absence of a working C compiler. Personally, I almost never use it.
Edit: One of the few(?) exceptions
Okay, here is the one killer application that I found for "WVM": Threading operations like Dot over tensors. The key point is the option RuntimeAttributes -> {Listable} and that we call the function Dot that has a very efficient, compiled backend. Moreover, with Parallelization -> True it also allows for some parallelization (which, unfortunately, does not scale particularly well).
Here the code for several threaded versions of Dot.
ClearAll[TensorDotThreadWVM];
TensorDotThreadWVM[rankA_Integer, rankB_Integer] :=
Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
A . B,
CompilationTarget -> "WVM",
RuntimeAttributes -> {Listable},
Parallelization -> True
];
ClearAll[TensorDotThreadWVMMemoized];
TensorDotThreadWVMMemoized[rankA_Integer, rankB_Integer] :=
TensorDotThreadWVMMemoized[rankA, rankB] =
Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
A . B,
CompilationTarget -> "WVM",
RuntimeAttributes -> {Listable},
Parallelization -> True
];
ClearAll[TensorDotThreadC];
TensorDotThreadC[rankA_Integer, rankB_Integer] :=
Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
A . B,
CompilationTarget -> "C",
RuntimeAttributes -> {Listable},
Parallelization -> True
];
ClearAll[TensorDotThreadCMemoized];
TensorDotThreadCMemoized[rankA_Integer, rankB_Integer] :=
TensorDotThreadCMemoized[rankA, rankB] =
Compile[{{A, _Real, rankA}, {B, _Real, rankB}},
A . B,
CompilationTarget -> "C",
RuntimeAttributes -> {Listable},
Parallelization -> True
];
For example, suppose we are given many, many matrices (rank $= 2$) of size {m1,m2} stored in some high rank packed array A and many, many matrices (rank $= 2$) of size {m2,m3} stored in some high rank packed array B like so:
n = 1000;
m1 = 12;
m2 = 8;
m3 = 11;
A = RandomReal[{-1, 1}, {n, n, m1, m2}];
B = RandomReal[{-1, 1}, {n, n, m2, m3}];
Now, we want to multiply each matrix in A with each corresponding matrix in B like so:
result = MapThread[Dot, {A, B}, 2]; // AbsoluteTiming // First
1.28851
Here are 4 alternatives to do the same:
resultWVM = TensorDotThreadWVM[2, 2][A, B]; // AbsoluteTiming // First
resultWVMMemoized = TensorDotThreadWVMMemoized[2, 2][A, B]; // AbsoluteTiming // First
resultC = TensorDotThreadC[2, 2][A, B]; // AbsoluteTiming // First
resultCMemoized = TensorDotThreadCMemoized[2, 2][A, B]; // AbsoluteTiming // First
0.154104
0.151299
0.600961
0.536399
Obviously, the two version that use CompilationTarget -> "C" are slower because of the compilation cost.
I am pretty sure that each of these methods calls the same BLAS routine blas_dgemm (or cblas_dgemm); otherwise the results would unlikely coincide bitwise:
Max[Abs[result - resultWVM]]
Max[Abs[result - resultWVMMemoized]]
Max[Abs[result - resultC]]
Max[Abs[result - resultCMemoized]]
TensorDotThreadCMemoized is memoized, i.e., the created CompiledFunction is stored and when it is called the next time:
So let's run this again (and let's measure time with RepeatedTiming):
result = MapThread[Dot, {A, B}, 2]; // RepeatedTiming // First
resultWVM = TensorDotThreadWVM[2, 2][A, B]; // RepeatedTiming // First
resultWVMMemoized = TensorDotThreadWVMMemoized[2, 2][A, B]; // RepeatedTiming // First
resultC = TensorDotThreadC[2, 2][A, B]; // RepeatedTiming // First
resultCMemoized = TensorDotThreadCMemoized[2, 2][A, B]; // RepeatedTiming // First
1.22665
0.172471
0.171726
0.326404
0.181896
Now we finally see that the TensorDotThreadWVM has already so little overhead, (i) that it is neither worth to memoize it and (ii) that compiling into a library does not help (or might even be slightly counterproductive).
Table) use the WVM for this reason. Compilation times can be very noticeably long on Windows. But something as advanced as auto-compilation would be a very small fraction of use-cases for users (as opposed to package developers). – Szabolcs Jan 03 '24 at 12:51