5

According to the documentation of CUDAFunctionLoad it should be easy to specify a compiled file (cubin, ptx, dll should all work) as the source for loading a CUDAFunction. Unfortunately it does not for me. Compiling from source works fine, but as soon as I try to load the CUDAFunction from the compiled file (I tried cubin, ptx and a dll) things fail.

Here is a very simple example that does not work for me, no matter what combination I try (you can just copy and paste the code into mathematica and run it as long as you have a proper CUDA setup):

Let's create a cubin file first from a very simple CUDA kernel:

Needs["CUDALink`"];
code = "
  __global__ void addTwo(int * in, int * out, int length) {
    int index = threadIdx.x + blockIdx.x*blockDim.x;
    if (index < length)
        out[index] = in[index] + 2;
  }";
cubinFile = CreateExecutable[code, "test", "Compiler" -> NVCCCompiler, 
   "CreateCUBIN" -> True];

This successfully creates test.cubin.

Unfortunately loading the function addTwo fails:

cudaFun = CUDAFunctionLoad[File[cubinFile], 
   "addTwo", {{"Integer32", _, "Input"}, {"Integer32", _, 
     "Output"}, "Integer32"}, 256, "ShellCommandFunction" :> Print, 
   "ShellOutputFunction" -> Print];

CUDAFunctionLoad::invsrc: CUDALink encountered invalid source input. The source input must be either a string containing the program, or a list of one element indicating the file containing the program.

The input file should be valid, but maybe I am missing something obvious here. Interestingly enough going the same route creating a .ptx file yields a different error:

ptxFile = 
  CreateExecutable[code, "test", "Compiler" -> NVCCCompiler, 
   "CreatePTX" -> True];
cudaFun = 
  CUDAFunctionLoad[File[ptxFile], 
   "addTwo", {{"Integer32", _, "Input"}, {"Integer32", _, 
     "Output"}, "Integer32"}, 256, "ShellCommandFunction" :> Print, 
   "ShellOutputFunction" -> Print];

CUDAFunctionLoad::notfnd: CUDALink resource not found.

Any ideas on what is going wrong here and how I can actually load a CUDAFunction from a compiled file?

You can just copy and paste the code above into mathematica and run it as long as you have CUDA setup properly. Can you reproduce the behavior?


Additional Information:
I am running Mathematica 11.2 on Windows 10.
Latest CUDA Paclet installed. CUDA setup working fine (can load CUDAFunctions from source, compile files, etc.).

Wizard
  • 2,720
  • 17
  • 28
  • Have you tried to wrap the filename by braces ({})? The first error message ask for a list containing a single element, namely the filename. Admittedly, that's rather unconventional in Mathematica but this can shield a string containing the filename from interpretation as a code string... – Henrik Schumacher Feb 15 '18 at 23:55
  • @HenrikSchumacher: Yes, I did. Same error message. – Wizard Feb 16 '18 at 13:25
  • Okay, that was just a shot into the dark for me. My graphics card not supporting CUDA, I cannot help you any further... Sorry. – Henrik Schumacher Feb 16 '18 at 13:36
  • @HenrikSchumacher: I get your reasoning and I had the same thought when reading the error message. Thanks for your efforts. – Wizard Feb 16 '18 at 13:52
  • 3
    The error message is misleading. It appears because a function fails in a very deep level of the code. The error is handled using Throw[$Failed] which is then caught by the part that creates the misleading error. The real error in the first case can be examined by evaluating the CUDAFunctionLoad line, then calling <<LibraryLink`, and then evaluating $LibraryError. For me this gives ".../libCUDALink_Double.so: undefined symbol: WolframCompileLibrary_wrapper". I spend a good 2 h debugging the CUDALink package and when I hear back from support@wri, I give a more detailed answer. – halirutan Feb 25 '18 at 01:32
  • 1
    @halirutan: You are one of the heroes of these forums! Thank you so much for all your efforts. I am truly amazed by your commitment and dedication to the community. – Wizard Feb 25 '18 at 10:11
  • "Our engineer most familiar with CUDA programming is out for the week. He will be most likely to provide assistance on this issue. I have asked that he look at this as soon as he returns. I apologize for the delay here, but if there is a potential solution/workaround for what you're seeing he will be most likely to find it." I guess the answer will arrive too late to grab your bounty and my findings will only tell you why it doesn't work and why I believe it is a bug. – halirutan Feb 27 '18 at 09:22
  • @hailrutan: Thank you very much for the follow up. I would appreciate if you would leave more information here, if you hear back from Wolfram support once again. I would say there is at least hope that this issue gets fixed in a future mathematica update. I also do hope they put more emphasis on cuda development for their future releases. – Wizard Feb 27 '18 at 12:59

1 Answers1

5

Here we go: Upfront, this was a pain as the CUDALink` package uses GPUTools`, LibraryLink` and since we are compiling on our own, CCompilerDriver` played its own role. Therefore, it was a lot of code to look at but I believe I found the reason.

TL;DR Okay, okay... I believe the main mistake was that by compiling the code with CreateExecutable does name-mangling which prevented Mathematica from successfully using the compiled kernel. Therefore, a minimal working example on my machine is the following:

<< CUDALink`

code = "
  __global__ void addTwo(int * in, int * out, int length) {
    int index = threadIdx.x + blockIdx.x*blockDim.x;
    if (index < length)
        out[index] = in[index] + 2;
  }";
cubin = CreateExecutable[code,
  "AddTwo",
  "Compiler" -> NVCCCompiler,
  "CUDAArchitecture" -> {"sm_61"},
  "CreateBinary" -> True,
  "CreateCUBIN" -> True,
  "UnmangleCode" -> True]
CUDAFunctionLoad[File[cubin], "addTwo", {{"Integer32", _, "Input"},
  {"Integer32", _, "Output"}, "Integer32"}, 256]

Steps to investigate

First, I needed to find out where on earth this error was coming from. I was stepwise following the code and in my case, the problem was in

CUDALink`Private`SetKernel["addTwo"]

If you look closer, then this calls a function

CUDALink`Private`cSetKernel["Double"]["addTwo"]

that returned a

LibraryFunctionError["LIBRARY_FUNCTION_ERROR", 6]

On the other hand, if you used CUDAFunctionLoad[code, ...] this call succeeded. This was hard to find, because above function simply Throw[$Failed] and the returned message what this exception was caught was not very helpful.

Now I started tracing back why this was failing. For this I mainly used PrintDefinitions from the <<GeneralUtilities` package because it lets me instantly change the code of the built-in functions and I could throw in some Print calls.

I knew that the CUDAFunctionLoad had to compile the kernel when it only gets code. When you look at the InputForm of a CUDAFunction you see the compiler switches it uses. I copied this exactly but still no luck. Then I got the idea to look at the md5 sum of the created cubin files: The one I create and the one that was automatically created by CUDAFunctionLoad. This was as different as their sizes were. That was odd.

After having a nice dinner and letting this settle for a bit, it hit me: How on earth can I be sure that the compiler call is the only thing that has to match. There must be a place where CUDAFunctionLoad calls itself CreateExectuable.

Solution

If my above example does not work for you, but you can successfully use CUDAFunctionLoad to compile a CUDA Kernel, or if you are just curious, then open the code for CUDALink`Private`iCompileCUDAKernel like this

<< GeneralUtilities`
PrintDefinitions[CUDALink`Private`iCompileCUDAKernel]

Near the end of the first long definition, you will find a Block:

Block[{CCompilerDriver`$ErrorMessageHead = errMsgHd},
        fileName = Check[
                Quiet[
                    CCompilerDriver`CreateExecutable[prog, fileName, Sequence@@createLibraryOpts],
                    CUDALink`Private`errMsgHd::wddirty
                ],
                $Failed
            ]
    ];

Before the call to CreateExecutable, just throw in a

Print[{prog, fileName, Sequence @@ createLibraryOpts}]

You need to do this on a fresh kernel when you have already compiled your CUDA kernel with CUDAFunctionLoad because things are cached. Evaluate the cell with Shift+Enter and then call

CUDAFunctionLoad[code, ...]

You will see the exact arguments to CreateExecutable that CUDALink uses internally. There are a lot more that I have given above, but the rest weren't necesarry in my case. If you compile your kernel with these options, you should be able to load the cubin.

I will send the support@WRI this information and ask them if they also feel that this should be documented somewhere.

halirutan
  • 112,764
  • 7
  • 263
  • 474
  • Thanks for the detailed reply. I tried your solution and it works. Interestingly enough I really had to manually specify "CUDAArchitecture" -> {"sm_61"}, as you did in your example above (it happens to conform to my Geforce GTX 1070). – Wizard Mar 02 '18 at 10:36