8

It seems that GPU training in Mathematica has some additional requirements on the software. For example, when I run this example on a Linux system, I get the following errors

trainingData = 
  RandomReal[1, {10000, 4}] -> RandomReal[1, {10000, 4}];
net = NetChain[{8, 4}];
NetTrain[net, trainingData, TargetDevice -> "GPU"]

[13:35:27] /home/dszeto/Desktop/mxnet0932_64/dmlc-core/include/dmlc/./logging.h:300: [13:35:27] /home/dszeto/Desktop/mxnet0932_64/src/storage/storage.cc:38: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: CUDA driver version is insufficient for CUDA runtime version

The system has a K40 GPU and CUDA 7.5 installed and has no problem running GPU version of tensorflow. Here is the detailed information:

$ uname -m && cat /etc/*release x86_64
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Red Hat Enterprise Linux Server release 6.8 (Santiago) Red Hat
Enterprise Linux Server release 6.8 (Santiago) 
$ gcc --version gcc
(GCC) 4.9.0 
$ uname -r
2.6.32-642.11.1.el6.x86_64
In[1]:= $Version
Out[1]:= 11.1.0 for Linux x86 (64-bit) (March 13, 2017)

From the error message, it seems that Mathematica is using the CUDA library that is newer than the driver on the system. However, the CUDA library on the system should be compatible with the GPU driver, since the GPU training in tensorflow works well. Does Mathematica use its own CUDA library rather than the CUDA library on the system? What are the software requirements such as:

  1. GCC
  2. graphics driver
  3. cuDNN
  4. CUDA
  5. MXNet

It seems that Mathematica comes with an MXNet library (in folder SystemFiles/Components/MXNetLink/LibraryResources/) but the others?

Moreover, what are the hierarchy structures of these software components that the Mathematica neural network framework builds on?

xslittlegrass
  • 27,549
  • 9
  • 97
  • 186
  • 1
    Error says CUDA driver version is insufficient for CUDA runtime version so start by updating your drivers to the latest version. Check directly on Nvidia's site for the latest version because sometime the OS check does not give the latest version. Also check that your GPU has compute capability 3 or higher with CUDAInformation. – Edmund Apr 17 '17 at 21:12
  • 1
  • @Edmund I'm on a GPU cluster with hundreds of nodes and updating the driver is a huge effort for them (involving resolving the potential conflicts and rebooting the cluster). I want to be sure of the problem. I don't want them to update the driver but later found that there are something else (for instance cudnn or CUDA) needs to be updated. The cluster environment is sometimes very different than the desktop environment, I want to be able to provide them a list of package requirements before asking them to update the system. – xslittlegrass Apr 17 '17 at 21:22
  • Are you able to execute CUDAInformation and CUDADriverVersion to at least see if the GPUs have "Compute Capabilities" >= 3 and how hold the driver running on them is? If they don't have "Compute Capabilities" requirement then the hardware can't support it. – Edmund Apr 17 '17 at 21:33
  • Nvidia's website says the Tesla K40 has Compute Capability 3.5 so good news that the hardware is capable. Check the driver version. – Edmund Apr 17 '17 at 21:38
  • 1
    @Edmund The GPU in tensorflow works well, so that means the CUDA and the GPU driver on the system are compatible with each other. I'm wondering whether it is the case that Mathematica uses its own CUDA library instead of the one on the system. – xslittlegrass Apr 17 '17 at 21:40
  • I'm guessing from this that mathematica comes with its own CUDA libraries, for the neural net functionality : http://community.wolfram.com/groups/-/m/t/917616 – dan7geo Apr 18 '17 at 10:38
  • 1
    @xslittlegrass I believe it does. It sounds like Mathematica may be trying to use the wrong CUDA version. I've asked other people in the company to take a look at this thread and comment. – Taliesin Beynon Apr 18 '17 at 12:17
  • @TaliesinBeynon Any news on this, should I contact the person directly through email or technical support? My email is 1:eJxTTMoPChZnYGCoKM7JLCnJSU0vSiwudkjPTczM0UvOzwUApVgK/w== – xslittlegrass Apr 20 '17 at 20:22
  • @sebastian mind commenting here? – Taliesin Beynon Apr 22 '17 at 15:27
  • 2
    @xslittlegrass: this should definitely work, as we ship the CUDA drivers + appropriate libraries. Can you try on 11.1.1, and see whether its still broken? (we rebuilt the libraries for 11.1.1) If it is, then I will contact you next week to try and resolve this together (if you have time). Its hard to debug this without a machine where this is failing... – Sebastian Apr 22 '17 at 16:55
  • @Sebastian I'd love to work with you to resolve this issue. Where can I download 11.1.1, I only have 11.1.0 in my user portal. – xslittlegrass Apr 22 '17 at 18:55
  • @Sebastian I tried with 11.1.1 and it is still broken. Here is the detail of the error message. – xslittlegrass Apr 26 '17 at 05:26

0 Answers0