I am trying to learn CUDA and start from the help examples. It turnes out that CPU calculation is much faster for matrix sizes < 3200 than CUDA, while CUDADot gives the error for sizes > 3300. I have Win10-64, i7-4702MQ 8GB, CUDADriver 373.06, the most recent for today, GeForce GT 750M with the following CUDAInformation[]
{1->{Name->GeForce GT 750M,Clock Rate->1085000,Compute Capabilities->3.,GPU Overlap->1,
Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{2147483647,65535,65535},
Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,
Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,
Maximum Registers Per Block->65536,Texture Alignment->512,Multiprocessor Count->2,
Core Count->64,Execution Timeout->1,Integrated->False,Can Map Host Memory->True,
Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65536,
Texture3D Width->4096,Texture3D Height->4096,Texture3D Depth->4096,
Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,
Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->False,TCC Enabled->False,
Total Memory->4294967296}}
For matrix size 3200 I get
randM = RandomReal[1, {3200, 3200}]; AbsoluteTiming[randM.randM;]
{0.857097,Null}
AbsoluteTiming[CUDADot[randM, randM];]
{2.56034,Null}
randMG = CUDAMemoryLoad[randM]
CUDAMemory(23176,Type->Double,Dimensions->{3200,3200},ByteCount->81920000,
Residence->DeviceHost,Sharing->Manual,Unique->True,Platform->1,Device->1,
MathematicaType->List,TypeInfromation->{})
AbsoluteTiming[res = CUDADot[randMG, randMG]]
{2.09685,CUDAMemory(10547,Type->Double,Dimensions->{3200,3200},ByteCount->81920000,
Residence->DeviceHost,Sharing->Shared,Unique->True,Platform->1,Device->1,
MathematicaType->List,TypeInfromation->{})}
For 3300
randM = RandomReal[1, {3300, 3300}]; AbsoluteTiming[randM.randM;]
{0.850639,Null}
AbsoluteTiming[CUDADot[randM, randM];]
CUDADot::lnchfld: CUDALink experienced a kernel launch failure. >>
{2.64628,Null}
And the same with CUDAMemoryLoad[]. This behaviour is present both in M9 and M10.4.
Thank you for any help to improve this.