CUDADot runs slowly and writes "CUDALink experienced a kernel launch failure" with matrix size > 3300

Question

I am trying to learn CUDA and start from the help examples. It turnes out that CPU calculation is much faster for matrix sizes < 3200 than CUDA, while CUDADot gives the error for sizes > 3300. I have Win10-64, i7-4702MQ 8GB, CUDADriver 373.06, the most recent for today, GeForce GT 750M with the following CUDAInformation[]

{1->{Name->GeForce GT 750M,Clock Rate->1085000,Compute Capabilities->3.,GPU Overlap->1,
Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{2147483647,65535,65535},
Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,
Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,    
Maximum Registers Per Block->65536,Texture Alignment->512,Multiprocessor Count->2,
Core Count->64,Execution Timeout->1,Integrated->False,Can Map Host Memory->True,
Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65536,
Texture3D Width->4096,Texture3D Height->4096,Texture3D Depth->4096,
Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,
Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->False,TCC Enabled->False,
Total Memory->4294967296}}

For matrix size 3200 I get

randM = RandomReal[1, {3200, 3200}]; AbsoluteTiming[randM.randM;]
{0.857097,Null}

AbsoluteTiming[CUDADot[randM, randM];]
{2.56034,Null}

randMG = CUDAMemoryLoad[randM]

CUDAMemory(23176,Type->Double,Dimensions->{3200,3200},ByteCount->81920000,
Residence->DeviceHost,Sharing->Manual,Unique->True,Platform->1,Device->1,
MathematicaType->List,TypeInfromation->{})

AbsoluteTiming[res = CUDADot[randMG, randMG]]

{2.09685,CUDAMemory(10547,Type->Double,Dimensions->{3200,3200},ByteCount->81920000,
Residence->DeviceHost,Sharing->Shared,Unique->True,Platform->1,Device->1,
MathematicaType->List,TypeInfromation->{})}

For 3300

randM = RandomReal[1, {3300, 3300}]; AbsoluteTiming[randM.randM;]
{0.850639,Null}

AbsoluteTiming[CUDADot[randM, randM];]
CUDADot::lnchfld: CUDALink experienced a kernel launch failure. >>
{2.64628,Null}

And the same with CUDAMemoryLoad[]. This behaviour is present both in M9 and M10.4.

Thank you for any help to improve this.

I tested it on MM 11.0 with my NVIDIA K4200. It has comparable data: — Eisbär, Oct 10 '16 at 08:09
I can easily use matricies of size 4200x4200 without issues My laptop already struggles with the 3200x3200 Array giving an "undefined" error. Does your error also ocure if you go just over 3200x3200 (3201x3201)? — Eisbär, Oct 10 '16 at 08:36
Further investigation in the issues with the laptop revealed that the calculation simply overheats the graphic card :-o. So it is definately not related the Mathematica. — Eisbär, Oct 10 '16 at 14:41

CUDADot runs slowly and writes "CUDALink experienced a kernel launch failure" with matrix size > 3300

0 Answers0