Fast generation of random variates as test case
The earlier question (195435) about performance tuning with regard to fast generation of random variates for a gamma distribution has stirred my curiosity and below I am using slightly simpler code from the 3rd ed. of Numerical Recipes (Normaldev) to test different versions of a numerical routine to generate random variates for a normal distribution.
Different Implementations
The following implementations use the same core body, but differ with regard to the way a function is declared in Mathematica:
normaldevis an uncompiled implementationnormaldevCusesCompilewithCompilationTarget → "C"normaldevCLPusesCompileas above, but adds listablility and parallelizationnormaldevLLVMfinally uses the new compiler andFunctionCompile
Code (`body` has been escaped as\`body\` ):
{ normaldev, normaldevC, normaldevCLP, normaldevLLVM } = With[
{
funcBody = "
While[ cond,
u = RandomReal[];
v = 1.7156 * (RandomReal[] - 0.5);
x = u -0.449871;
y = Abs[v] + 0.386595;
q = x*x + y *(0.19600*y - 0.25482 * x);
If[ q > 0.27597 && ( q > 0.27846 || v*v > -4. * Log[u] * u * u ), cond = True, cond = False ]
];
mu + sig * v/u"
},
Module[
{ normaldev, normaldevC, normaldevCLP, normaldevLLVM }
,
normaldev = StringTemplate["
Function[ {mu, sig },
Module[
{
u = 1.0,
v = 1.0,
x = 1.0,
y = 1.0,
q = 1.0,
cond = True
}
,
\`body\`
]
]"
];
normaldevC = StringTemplate["
Compile[ { { mu, _Real }, { sig, _Real } },
Module[
{
u = 1.0,
v = 1.0,
x = 1.0,
y = 1.0,
q = 1.0,
cond = True
}
,
\`body\`
],
CompilationTarget -> \"C\"
]"
];
normaldevCLP = StringTemplate["
Compile[ { { mu, _Real }, { sig, _Real } },
Module[
{
u = 1.0,
v = 1.0,
x = 1.0,
y = 1.0,
q = 1.0,
cond = True
}
,
\`body\`
],
CompilationTarget -> \"C\",
RuntimeAttributes -> {Listable},
Parallelization -> True
]"
];
normaldevLLVM = StringTemplate["
FunctionCompile[
Function[
{
Typed[ mu, \"Real64\" ],
Typed[ sig, \"Real64\" ]
}
,
Module[
{
Typed[ u, \"Real64\" ] = 1.0,
Typed[ v, \"Real64\" ] = 1.0,
Typed[ x, \"Real64\" ] = 1.0,
Typed[ y, \"Real64\" ] = 1.0,
Typed[ q, \"Real64\" ] = 1.0,
Typed[ cond, \"Boolean\" ] = True
}
,
\`body\`
]
],
CompilerOptions -> {
\"AbortHandling\" -> False,
\"LLVMOptimization\" -> \"ClangOptimization\"[3]
}
]"
];
Map[ ToExpression @ TemplateApply[ #, <| "body" -> funcBody |>]&, { normaldev, normaldevC, normaldevCLP, normaldevLLVM } ]
]
];
Performance evaluation
Using RepeatedTiming we can compare run times for these implementations, which I have sorted below from slowest to fastest.
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldev[0.,1.], {1000000}] ]
(* {0.535271,{-0.00048065,1.00115}} *)
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldevLLVM[0., 1.], {1000000}] ]
(* {0.199849,{0.00011759,1.00026}} *)
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldevC[0.,1.], {1000000}] ]
(* {0.18677,{-0.00244043,0.998699}} *)
With[ { args = Transpose @ ConstantArray[{0.0,1.0},1000000]},
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ ( normaldevCLP @@ args )]
]
(* {0.119437,{0.000838859,0.999507}} *)
Questions
- What would be the recommended implementation for the
LLVMversion if speed is the main goal, e.g., setting options, usingNative`Randominstead ofRandomReal? - How can the benefits of listability and parallelization be obtained for
Compiled CodeFunction?
FunctionCompilein this site e.g. https://mathematica.stackexchange.com/a/270431/1871 https://mathematica.stackexchange.com/a/264714/1871 https://mathematica.stackexchange.com/a/261329/1871 I agree as an EXPERIMENTAL function it may not be ready for serious work, but perhaps it's time to give it a try. – xzczd Aug 13 '22 at 12:21FunctionCompileis experimental. Nonetheless I do not really see where this project is heading. – Henrik Schumacher Aug 13 '22 at 16:10FunctionCompile, I really appreciate the nice demonstration of LibraryLink! (+1) – Silvia Aug 15 '22 at 09:42cffunction in my post mentioned in @xzczd 's comment above): Maybe just me illusion or misunderstanding of the underlying mechanism, but it feels like the interface provided byFunctionCompileis lighter than the one provided by LibraryLink, which I think is the reason mycfis slightly faster than @jacob-akkerboom 's solution. (Or is it?) – Silvia Aug 15 '22 at 10:11FunctionCompileversion against this, which has a parallelizedTable(or similar looping constructs) inside. So far, I had some difficulty building a vectorized version withFunctionCompileand things likeParallel`Tableseem a bit hard to find in the documentation. – gwr Aug 15 '22 at 11:38FunctionCompilefeels to me neither like Mathematica nor like C. I find that a bit weird. – Henrik Schumacher Aug 15 '22 at 12:59Compileis often convenient, but it has some severe performance problems (e.g., because it does not feature pass by reference) and limitations (only one return value). And I myself had my problems with getting used to LibraryLink. IMHO, Szabolcs'"LTemplate" ` package follows a much better ansatz (on top of LibraryLink). – Henrik Schumacher Aug 15 '22 at 13:02libData->MTensor_setReal(tensor, pos, 1). There might be several potential reasons why that matters: (i) The C-style arrayposdecays into a pointer, meaning additional indirection. This can be avoided by grabing the raw pointer once withlibData->MTensor_getRealand then use normal indexing. – Henrik Schumacher Aug 15 '22 at 13:11MTensor_setRealis a function pointer in the pointer to the structlibData. So at least one additional indirection per function call (although the compiler might be clever enough to prevent this). – Henrik Schumacher Aug 15 '22 at 13:12libData->MTensor_setRealincoporates a bound check to prevent segfault. Using a raw pointer also deactivates this. – Henrik Schumacher Aug 15 '22 at 13:13FunctionCompilealso has similar things, as I see explicitNative`UncheckedBlockused in source files which looks like triggering different IR generated. – Silvia Aug 15 '22 at 13:25ToRawPointer(or byArray`GetData). There is a simple experiment for that: Call your functioncf2on a $10000 \times 10000$ arraymat, but withcf2[mat, 20000]. This entails write access to a part of the memory where it is unlikely that the process has write priviledges. At least my machine reacts with a kernel quit (as it should in such a case). – Henrik Schumacher Aug 15 '22 at 13:37Native`UncheckedBlockdoes more than that. Although I don't really know anything about LLVM IR, it seems with v.s. without it, the IR emitted will be different, for example% var72 = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 % var2, i64 % var2)v.s.% var72 = mul i64 % var2, % var2. I'm on the way home right now. I'll try to put a self-contained example when I got home. I learnt some undocumented tools in the new compiler from Wolfram Summer School. They are super handy and it's a shame they haven't been documented. – Silvia Aug 15 '22 at 13:55NativeUncheckedBlockin this case. One of the things that the compiler option"-Ofast"` will enforce. – Henrik Schumacher Aug 15 '22 at 15:14