How can CompiledCodeFunctions be optimized and parallelized?

Question

Fast generation of random variates as test case

The earlier question (195435) about performance tuning with regard to fast generation of random variates for a gamma distribution has stirred my curiosity and below I am using slightly simpler code from the 3rd ed. of Numerical Recipes (Normaldev) to test different versions of a numerical routine to generate random variates for a normal distribution.

Different Implementations

The following implementations use the same core body, but differ with regard to the way a function is declared in Mathematica:

normaldev is an uncompiled implementation
normaldevC uses Compile with CompilationTarget → "C"
normaldevCLP uses Compile as above, but adds listablility and parallelization
normaldevLLVM finally uses the new compiler and FunctionCompile

Code (`body` has been escaped as\`body\` ):

{ normaldev, normaldevC, normaldevCLP, normaldevLLVM } = With[
    {
      funcBody = "
                While[ cond,
                    u = RandomReal[];
                    v = 1.7156 * (RandomReal[] - 0.5);
                    x = u -0.449871;
                    y = Abs[v] + 0.386595;
                    q = x*x + y *(0.19600*y - 0.25482 * x);
                    If[ q > 0.27597 && ( q > 0.27846 || v*v > -4. * Log[u] * u * u ), cond = True, cond = False ]
                ];
                mu + sig * v/u"
    },
    Module[
        { normaldev, normaldevC, normaldevCLP, normaldevLLVM }
        ,
        normaldev = StringTemplate["
        Function[ {mu, sig },
            Module[
                {
                    u = 1.0,
                    v = 1.0,
                    x = 1.0,
                    y = 1.0,
                    q = 1.0,
                    cond = True
                }
                ,
                \`body\`
            ]
        ]"
        ];
        normaldevC = StringTemplate["
        Compile[ { { mu, _Real }, { sig, _Real } },
            Module[
                {
                    u = 1.0,
                    v = 1.0,
                    x = 1.0,
                    y = 1.0,
                    q = 1.0,
                    cond = True
                }
                ,
                \`body\`
            ],
            CompilationTarget -> \"C\"
        ]"
        ];
        normaldevCLP = StringTemplate["
        Compile[ { { mu, _Real }, { sig, _Real } },
            Module[
                {
                    u = 1.0,
                    v = 1.0,
                    x = 1.0,
                    y = 1.0,
                    q = 1.0,
                    cond = True
                }
                ,
                \`body\`
            ],
            CompilationTarget -> \"C\",
            RuntimeAttributes -> {Listable},
            Parallelization -> True
        ]"
        ];
        normaldevLLVM = StringTemplate["
        FunctionCompile[
            Function[
                {
                    Typed[ mu,  \"Real64\" ],
                    Typed[ sig, \"Real64\" ]
                }
                ,
                Module[
                    {
                        Typed[ u,    \"Real64\"  ] = 1.0,
                        Typed[ v,    \"Real64\"  ] = 1.0,
                        Typed[ x,    \"Real64\"  ] = 1.0,
                        Typed[ y,    \"Real64\"  ] = 1.0,
                        Typed[ q,    \"Real64\"  ] = 1.0,
                        Typed[ cond, \"Boolean\" ] = True
                    }
                    ,
                    \`body\`
                ]
            ],
            CompilerOptions -> {
                \"AbortHandling\" -> False,
                \"LLVMOptimization\" -> \"ClangOptimization\"[3]
            }
        ]"
        ];
        Map[ ToExpression @ TemplateApply[ #, <| "body" -> funcBody |>]&, { normaldev, normaldevC, normaldevCLP, normaldevLLVM } ]
    ]
];

Performance evaluation

Using RepeatedTiming we can compare run times for these implementations, which I have sorted below from slowest to fastest.

RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldev[0.,1.], {1000000}] ]
(* {0.535271,{-0.00048065,1.00115}} *)
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldevLLVM[0., 1.], {1000000}] ]
(* {0.199849,{0.00011759,1.00026}} *)
RepeatedTiming @ Through[ {Mean, StandardDeviation}@ Table[ normaldevC[0.,1.], {1000000}] ]
(* {0.18677,{-0.00244043,0.998699}} *)
With[ { args =  Transpose @ ConstantArray[{0.0,1.0},1000000]},
    RepeatedTiming @ Through[ {Mean, StandardDeviation}@ ( normaldevCLP @@ args )]
]
(* {0.119437,{0.000838859,0.999507}} *)

Questions

What would be the recommended implementation for the LLVM version if speed is the main goal, e.g., setting options, using Native`Random instead of RandomReal?
How can the benefits of listability and parallelization be obtained for Compiled CodeFunction?

Henrik Schumacher · Accepted Answer · 2022-08-16T08:19:30.987

In general, the weak spots of interpreted languages are loop constructs and function calls. So I'd suggest to push also the Table into the compiled code to generate many random numbers at once.

But the weakness of normaldevCLP lies at another point: You first generate a constant array of means and standard deviations. Then you send all this data to the library function. This is two real numbers for each real number that you want to generate. You can avoid this by using

normaldevCLP2 = StringTemplate["
        Compile[ { { mu, _Real }, { sig, _Real }, { n, _Integer } },
            Table[
                Module[
                    {
                        u = 1.0,
                        v = 1.0,
                        x = 1.0,
                        y = 1.0,
                        q = 1.0,
                        cond = True
                    },
                    `body`
                ]
            ,{i,1,n}],
            CompilationTarget -> \"C\",
            RuntimeAttributes -> {Listable},
            Parallelization -> True,
            RuntimeOptions -> \"Speed\"
        ]"];

On a 8-core machine for example, you would invoke the compiled function later like this:

n = 1000000;
\[Mu] = 0.;
\[Sigma] = 1.;
AdevCLP2 = normaldevCLP2[\[Mu], \[Sigma], ConstantArray[Quotient[n, 8], 8]];

On my machine, this yields a 2.5-fold speedup compared to

AdevCLP = normaldevCLP[ConstantArray[\[Mu], n], ConstantArray[\[Sigma], n]]

"What would be the recommended implementation for the LLVM version if speed is the main goal [...] ?"

I'd recommend not to use FunctionCompile at all. IMHO, it is completely ill-designed, in particular when you compare its clunky syntax with the ease of use of numba + numpy in Python. And I have not seen any example, yet, where FunctionCompile's performance was comparable to ~~Compile~~ LibraryLink. The only advantage of FunctionCompile seems to be that it allows for way more data types. And, btw., what the heck is the point of wrapping options of LLVM with an additional abstract layer like

 \"LLVMOptimization\" -> \"ClangOptimization\"[3]

?

This is just inflexible and unmaintainable. Why not simply using the approach of CreateLibrary by just sending the string(s) of compiler options directly to the compiler like, e.g.,

"CompileOptions" -> {
  , "-Xpreprocessor -fopenmp -fopenmp-simd"
  , "-Ofast"
  , "-flto"
  }

?

If you really look for performance, then better use the good ol' LibraryLink. It's syntax is awful, but one gets used to it. Here some example code to generate normally distributed numbers with the C++ standard library and OpenMP for parallelization:

Needs["CCompilerDriver`"];
Module[{lib, file, name},
name = &quot;normalLibraryLink&quot;;

Print[&quot;Compiling &quot; &lt;&gt; name &lt;&gt; &quot;...&quot;];

file = Export[FileNameJoin[{$TemporaryDirectory, name &lt;&gt; &quot;.cpp&quot;}],

"
#include&quot;WolframLibrary.h&quot;
#include <omp.h>
#include <random>
EXTERN_C DLLEXPORT int " <> name <> 
    "(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res)
{
    // Grab the input arguments.
    mreal mu           = MArgument_getReal(Args[0]);
    mreal sigma        = MArgument_getReal(Args[1]);
    MTensor A          = MArgument_getMTensor(Args[2]);
    mint  thread_count = MArgument_getInteger(Args[3]);
// Using OpenMP for parallelization.
#pragma omp parallel num_threads( thread_count )
{
    // Slow, but truely random number generator for seeding.
    std::random_device r;

    // std::random_device generates 32-bit numbers, so we should use two of them to seed a 64-bit random number engine.
    std::seed_seq seed { r(), r() };

    // A fast pseudorandom number generator, seeded by a random number.
    // 64-bit Mersenne Twister by Matsumoto and Nishimura, 2000
    std::mt19937_64 random_engine ( seed );

    std::normal_distribution&lt;mreal&gt; dist (mu, sigma);

    // Gamma distribution is available by
    // std::gamma_distribution&lt;mreal&gt; dist (alpha, beta);
    // See https://cplusplus.com/reference/random/ for further random distributions.

    // Compute the data range [i_begin,i_end[ for this thread.
    const mint n       = libData-&gt;MTensor_getDimensions(A)[0];  
    const mint thread  = omp_get_thread_num();
    const mint i_begin = (n / thread_count) * (thread  ) + (n % \ thread_count * (thread  )) / thread_count;
    const mint i_end   = (n / thread_count) * (thread+1) + (n % \ thread_count * (thread+1)) / thread_count;

    // Get the pointer to the start of shared buffer supplied by the tensor argument.
    mreal * a = libData-&gt;MTensor_getRealData(A);

    // Fill the range [i_begin,i_end[ of a with random numbers drawn from the distribution dist.
    for( mint i = i_begin; i &lt; i_end; ++i )
    {
        a[i] = dist( random_engine );
    }
}

// Tell the library that it has to let go A.
libData-&gt;MTensor_disown(A);

return LIBRARY_NO_ERROR;

}",
   "Text"
    ];
lib = CreateLibrary[{file}, name
   , "TargetDirectory" -> $TemporaryDirectory
   , "ShellOutputFunction" -> Print
   , "CompileOptions" -> {
     "-Wall"
     , "-Wextra"
     , "-Wno-unused-parameter"
     , "-mmacosx-version-min=12.0"
     , "-std=c++11"
     , "-Xpreprocessor -fopenmp -fopenmp-simd"
     , "-Ofast"
     , "-flto"
     , "-mcpu=apple-m1"
     , "-mtune=native"
     }
   , "LinkerOptions" -> {"-lm", "-ldl", "-lomp"}
   , "IncludeDirectories" -> {"/opt/local/include/libomp"}
   , "LibraryDirectories" -> {"/opt/local/lib/libomp"}
   (*,"ShellCommandFunction"[Rule]Print*)
   , "ShellOutputFunction" -> Print
    ];
    Print["Compilation done."];
normalLibraryLink = LibraryFunctionLoad[lib, name, 
  {
    Real,                 (*mu*)
    Real,                 (*sigma*)
    {Real, 1, &quot;Shared&quot;},  (*vector A; passed by reference. Caution: A gets modified!*)
    Integer               (*thread_count*)
  },
  &quot;Void&quot;                  (*no return value*)
]

]

You can run it by first letting Mathematica allocate some array

a = ConstantArray[0., n];

Then you invoke normalLibraryLink to fill it with random numbers, using 8 threads:

normalLibraryLink[0., 1., a, 8]; // RepeatedTiming

0.00177217

The advantage is that the array a is passed as "Shared", which means "by reference". So the library can modify the array directly without any allocations or copy operations. You can fill the array a; do something with the random numbers; and then refill the same array again. On my machine this is six times faster than AdevCLP2 and 130 times faster than AdevLLVM.

The compile options are selected for macos on an Apple M1 processor and with OpenMP installed via Macports. Of course, you have to adapt them appropriately for other platforms.

"And I have not seen any example, yet, where FunctionCompile's performance was comparable to Compile. " Well, then you've missed recent examples using FunctionCompile in this site e.g. https://mathematica.stackexchange.com/a/270431/1871 https://mathematica.stackexchange.com/a/264714/1871 https://mathematica.stackexchange.com/a/261329/1871 I agree as an EXPERIMENTAL function it may not be ready for serious work, but perhaps it's time to give it a try. — xzczd, Aug 13 '22 at 12:21
Thanks for the links. Really interesting. I know that FunctionCompile is experimental. Nonetheless I do not really see where this project is heading. — Henrik Schumacher, Aug 13 '22 at 16:10
Though I don't share the same idea about FunctionCompile, I really appreciate the nice demonstration of LibraryLink! (+1) — Silvia, Aug 15 '22 at 09:42
(continue..) I actually had the same thought as OP before, but Simon's post (the one mentioned in @xzczd 's comment) convinced me to give it a try. I think the new compiler is still in very early rapid evolving phase, even it's design might change in the future. Also I think the "clunky syntax" will not be a permanent problem -- maybe after the compiler is stable, people (either WRI or our community) could take time to shield that from users through an abstract layer (maybe a C-like DSL?). — Silvia, Aug 15 '22 at 09:58
(continue..) Another observation (see the cf function in my post mentioned in @xzczd 's comment above): Maybe just me illusion or misunderstanding of the underlying mechanism, but it feels like the interface provided by FunctionCompile is lighter than the one provided by LibraryLink, which I think is the reason my cf is slightly faster than @jacob-akkerboom 's solution. (Or is it?) — Silvia, Aug 15 '22 at 10:11
+1 Thank you, Henrik, for a very helpful post. It would be nice top have a comparable way to put a FunctionCompile version against this, which has a parallelized Table (or similar looping constructs) inside. So far, I had some difficulty building a vectorized version with FunctionCompile and things like Parallel`Table seem a bit hard to find in the documentation. — gwr, Aug 15 '22 at 11:38
Hi @Silvia, thank you for your feedback. I have the tendency to be a bit harsh at times. Sorry to anybody who feels offended by that. "Also I think the "clunky syntax" will not be a permanent problem" Now I see what you mean. I have not though of that before. Nonetheless, FunctionCompile feels to me neither like Mathematica nor like C. I find that a bit weird. — Henrik Schumacher, Aug 15 '22 at 12:59
Anyways, I appreciate that Wolfram Research invests ressources into a better compilation interface. Compile is often convenient, but it has some severe performance problems (e.g., because it does not feature pass by reference) and limitations (only one return value). And I myself had my problems with getting used to LibraryLink. IMHO, Szabolcs' "LTemplate" ` package follows a much better ansatz (on top of LibraryLink). — Henrik Schumacher, Aug 15 '22 at 13:02
With regard to @jacob-akkerboom 's solution: I think the issue there is the use of libData->MTensor_setReal(tensor, pos, 1). There might be several potential reasons why that matters: (i) The C-style array pos decays into a pointer, meaning additional indirection. This can be avoided by grabing the raw pointer once with libData->MTensor_getReal and then use normal indexing. — Henrik Schumacher, Aug 15 '22 at 13:11
(ii) Looking into WolframLibrary.h, I see that MTensor_setReal is a function pointer in the pointer to the struct libData. So at least one additional indirection per function call (although the compiler might be clever enough to prevent this). — Henrik Schumacher, Aug 15 '22 at 13:12
(iii) I'd guess that libData->MTensor_setReal incoporates a bound check to prevent segfault. Using a raw pointer also deactivates this. — Henrik Schumacher, Aug 15 '22 at 13:13
Hi Henrik, with no doubt you are a far more expert than me regarding high performance programming and compiling techniques. I have learnt a lot from your posts on this site many times. As an ordinary user who is not familiar with C as much as with WL, I'm really holding my breath to wish and see what this brand new compiler will offer. Currently I think the lack-of-documentation might be one of the biggest issue. I just hope WRI ship a manual soon! Thank you very much again for having this discussion. — Silvia, Aug 15 '22 at 13:19
Regarding bound check, I think FunctionCompile also has similar things, as I see explicit Native`UncheckedBlock used in source files which looks like triggering different IR generated. — Silvia, Aug 15 '22 at 13:25
Hm. I think the bound checks are already deactivated by ToRawPointer (or by Array`GetData). There is a simple experiment for that: Call your function cf2 on a $10000 \times 10000$ array mat, but with cf2[mat, 20000]. This entails write access to a part of the memory where it is unlikely that the process has write priviledges. At least my machine reacts with a kernel quit (as it should in such a case). — Henrik Schumacher, Aug 15 '22 at 13:37
I see what you meant. I think Native`UncheckedBlock does more than that. Although I don't really know anything about LLVM IR, it seems with v.s. without it, the IR emitted will be different, for example % var72 = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 % var2, i64 % var2) v.s. % var72 = mul i64 % var2, % var2. I'm on the way home right now. I'll try to put a self-contained example when I got home. I learnt some undocumented tools in the new compiler from Wolfram Summer School. They are super handy and it's a shame they haven't been documented. — Silvia, Aug 15 '22 at 13:55
That looks as if integer overflow checks get deactivated by NativeUncheckedBlockin this case. One of the things that the compiler option"-Ofast"` will enforce. — Henrik Schumacher, Aug 15 '22 at 15:14

How can CompiledCodeFunctions be optimized and parallelized?

Fast generation of random variates as test case

Different Implementations

Performance evaluation

Questions

1 Answers1