GPU user experiences

Question

Edit

Thanks to all who have responded! I now have plenty of information on the GeForce 650M GPU as found in most new, mid-range Mac laptops; the Mandelbrot code below runs in comfortably under a tenth of a second. I'm still quite curious how that compares to the GeForce 675MX GPU, which comes in new, high end iMacs. If someone provides that information as an answer, I'll happily accept!

Original

When V8 came out, I was very excited about the introduction of GPU support. So far, however, I've not really been able to take advantage of it the way I would like to, simply because I haven't had a computer with an adequate GPU. Even when CUDA and/or OpenCL are supported, one might not see significant gains when using the GPU due (I think) to limited block size.

Well, now it appears that I'll be getting a new computer in the not so distant future and I'd like to make sure that I'm happy with the GPU. To that end, I wonder if I could receive some feedback on actual user experiences. I am specifically interested in how CUDA and OpenGL work with midline Mac Laptops and Desktops. Laptops seem to run NVIDIA GeForce 640M or 650M GPUs, while desktops are more flexible going up to the 675 or higher in iMacs.

So, does CUDA run on these machines? Specifically, does at least the following return True:

(* Warning - a large package is loaded from Wolfram Research *)
Needs["CUDALink`"];
CUDAQ[]

(* Out: False *)

I am aware, of course, of the systems requirements documentation here: http://reference.wolfram.com/mathematica/CUDALink/tutorial/Reference.html#1803279895

That page indicates that CUDA does not run on 600 level GeForce GPUs, but I have a hard time believing that CUDA still is not running on main-line Macs at this point. Furthermore, NVIDIA's information seems to conflict with this here: https://developer.nvidia.com/cuda-gpus

Hence, the question.

Also, I wonder if folks wouldn't mind trying a little test - say, generate a Mandelbrot set, since I'm quite curious about the relative speed of GPUs as accessed through Mathematica. To that end, here's a simple OpenCL program that generates escape times counts to generate a Mandelbrot image. Note that the blockSize parameter can be changed. I'm not an expert on GPU programming, but I believe that higher end GPUs generally allow higher block sizes and that larger block sizes permit more parallelization. My computer allows a blockSize of 16 or lower; lowering the blockSize generally slows down the computation.

Needs["OpenCLLink`"];
blockSize = 16;
code = "
  __kernel void mandel_kernel(__global Real_t *mSet, int xRes, int yRes, 
     Real_t xMin, Real_t xMax, Real_t yMin, Real_t yMax) {
     int xIndex = get_global_id(0);
     int yIndex = get_global_id(1);
     int i;

     Real_t cx = xMin + xIndex*(xMax-xMin)/xRes;
     Real_t cy = yMin + yIndex*(yMax-yMin)/yRes;
     Real_t x = cx;
     Real_t y = cy;
     Real_t tmp;

     if (xIndex < xRes && yIndex < yRes) {
         for (i = 0; i < MAX_ITERATIONS && x*x + y*y <= BOUND_SQUARED; i++) {
            tmp = x*x - y*y + cx;
            y = 2*x*y + cy;
            x = tmp;
        }
        mSet[xIndex + yIndex*yRes] = i;
      }
  }
  ";
If[OpenCLQ[] === True,
  mandelCalculate = OpenCLFunctionLoad[code, "mandel_kernel", {{_Real, _, "Output"}, 
    _Integer, _Integer, _Real, _Real, _Real, _Real}, {blockSize, blockSize}, 
    "Defines" -> {"MAX_ITERATIONS" -> 100, "BOUND_SQUARED" -> "4.0"}],
  Print["I'm sorry, your computer is even lamer than Mark's!"]
];

Assuming your computer actually passes the test, the following will actually run the program for an xRes by yRes square.

xRes = 1500; yRes = 1500;
(mSet = OpenCLMemoryAllocate[Real, {xRes, yRes}];
 mandelCalculate[mSet, xRes, yRes, -2.0, 0.6, -1.3, 1.3];
 data = OpenCLMemoryGet[mSet]); // AbsoluteTiming

(* Out: {0.065236, Null} *)

Yeah, that's pretty fast. The computation was performed on the following GPU.

"Renderer" /. ("OnScreen" /. ("OpenGL" /. 
 SystemInformation["Devices", "GraphicsDevices"]))

(* Out: "ATI Radeon HD 6750M OpenGL Engine" *)

Again, the point is to compare GPUs but, if you want to generate an image, here's one way to do so:

colors = Map[{(100 - #)^2/10000, (100 - #)^3/1000000, (100 - #)/100} &, data, {2}];
Image[colors]

(* Out: Groovy picture *)

"Specifically, does at least the following return True" — For a lot of folks, it might not be as simple as that (esp. macs). I know for a fact that on both my MBPs (late 2008 and retina), CUDAQ[] repeatedly gave False even after installing the necessary packages from WRI. In the case of my old laptop, I had to manually install an update from NVIDIA, and in my retina MBP, I had to force the graphics card to always use the NVIDIA card and not the integrated one before I could get them to work (return True). — rm -rf, Feb 26 '13 at 22:48
Btw, I get "OpenCLFunction::invblksz: OpenCLLink block size is invalid. " when I run the second code block starting with xRes=... — rm -rf, Feb 26 '13 at 22:54
Slightly related : http://stackoverflow.com/questions/8638905/would-a-better-graphics-card-or-more-cores-make-mathematica-faster/8639021#8639021 — Artes, Feb 26 '13 at 23:07
@rm-rf Per the installation comment - I had a similar experience and needed to install separate software from NVIDIA when trying to get CUDAQ to return True when this stuff first came out. My impression is that this is the reason that Wolfram's data paclets are accessed when first loading CUDALink. — Mark McClure, Feb 26 '13 at 23:08
@rm-rf Per the block size: If you get the invalid block size message, you might try lowering the blockSize parameter in the code. — Mark McClure, Feb 26 '13 at 23:09
Ok, block size 1 worked for me. And boy, that picture is pretty! Mine clocked at 0.0657 on NVIDIA GeForce GT 650M OpenGL Engine — rm -rf, Feb 26 '13 at 23:14
@rm-rf It's essentially C accessing the OpenGL API. The picture is so nice because of the high resolution. And thanks for the info! — Mark McClure, Feb 26 '13 at 23:18
@MarkMcClure It seems that Mathematica 9 always forces using the NVIDIA GPU on a Mac laptop. Setting up CUDA was pretty straightforward: I installed the package from here, and after this CUDAQ[] returned True. This is with GeForce GT 650M — Szabolcs, Feb 26 '13 at 23:19
How can I find out what are valid block sizes, and whether OpenCLLin actually uses the GPU? It only accepted a block size of 1. — Szabolcs, Feb 26 '13 at 23:22
@Szabolcs: That has not been my experience on Ubuntu 12.04. The CUDA stuff works (ie all the NVIDIA examples, but CUDALink .... Not. — dwa, Feb 26 '13 at 23:23
@rm-rf and Mark -- it seems that OpenCLLink uses the CPU by default on an rMBP, not the GPU. Check SystemInformation[] under Links: Device 1 is the CPU and $OpenCLDevice is set to 1. I'm not entirely sure how to change that (probably $OpenCLDevice = $OpenCLPlatform = 2, but I don't really see any difference) — Szabolcs, Feb 26 '13 at 23:29
@dwa My comment referred specifically to Mac laptops with NVIDIA GPUs, running OS X. — Szabolcs, Feb 26 '13 at 23:30
@rm-rf It seems OpenCL always keeps reverting to running on the CPU only. $OpenCLPlatform reverts to 1 here. The docs say this variable has to be set before you load the opencl function, but at that point it resets to 1. — Szabolcs, Feb 26 '13 at 23:36
@Szabolcs I don't understand what you mean by "Mathematica 9 always forces using the NVIDIA GPU on a Mac laptop". I'm at home now too with my even lamer laptop, as opposed to my office desktop, so I'll really have to wait to experiment more until tomorrow. — Mark McClure, Feb 27 '13 at 00:35
Thanks for all the info, everyone! I wonder if anyone has a GeForce 675?? — Mark McClure, Feb 27 '13 at 00:35
@MarkMcClure The laptop I'm using has two GPUs: a low power Intel one and an NVIDIA one. It automatically switches between the two to balance performance and battery life. rm-rf mentioned that on such a laptop you need to force using the NVIDIA CPU (change energy saving preferences), otherwise CUDALink won't work. This is true for Mathematica 8, but it is not necessary for Mathematica 9 because 9 always triggers a switch to the NVIDIA GPU on startup (which is rather unfortunately because that's not needed most of the time and reduces battery life ...) — Szabolcs, Feb 27 '13 at 00:40
My 3+ year old MacPro desktop with ATI Radeon HD 4870 runs your code in 0.077 sec. My 2012 MacBook Air with Intel HD Graphics 4000 runs it in 0.110 sec. Both are running MMA 9. — Cassini, Feb 27 '13 at 01:25
Ok, same here now - I can run it errorless with blockSize = 1 in $0.07344$ sec. No luck trying to change $OpenCLPlatform either. — gpap, Feb 27 '13 at 11:34
@rm-rf I hope you don't find my edit to be of the Lena topping on the transparent box variety! — Mark McClure, Feb 27 '13 at 14:26
@Mark be careful about the test results people are mentioning in comments. I'm pretty sure that on many Macs your OpenCL code will be run on the CPU, not the GPU. The timings will not be representative of the GPU. At least this is the case on my laptop (retina macbook, GeForce GT650M). — Szabolcs, Feb 27 '13 at 14:28
@Szabolcs Really? I had no idea! I honestly didn't even think it could run on the CPU. It certainly couldn't run on the CPU in under a second could it? Let' alone in under a tenth of a second. — Mark McClure, Feb 27 '13 at 14:31
@Mark OpenCL is quite general, it allows running on the GPU or (multiple) CPUs. On my machine OpenCLInformation[] returns one "platform" ("Apple") and two "devices". The first device is the CPU, the second the GPU. It should be possible to select the default one using $OpenCLDevice and $OpenCLPlatform, but for some reason the device always reverts to the CPU when running your code. CUDA on the other hand is only for (NVIDIA) GPUs, it doesn't support running on CPUs. — Szabolcs, Feb 27 '13 at 14:38
@Szabolcs Well, I'm pretty baffled! Obviously, I'm just learning about GPU programming, which should prove very useful to me, if I can master it. Ultimately, though, my immediate decision is between a new Macbook Pro or a new Mac Desktop and the question is "are the speed advantages of the desktop worth the immobility?" The answer would likely be "yes", if we're talking about close to 10 times as fast. — Mark McClure, Feb 27 '13 at 14:50
@MarkMcClure bear in mind that GPUs are, in practice, usually not the incredible powerhouses they're often made out to be. A top-end GPU has about a five- to ten-fold theoretical advantage over a top-end CPU, given well-tuned code running on both. However, optimizing for GPUs is difficult, and a CPU will tolerate less well tuned code much better than a GPU will. Furthermore, a mid-range GPU is generally proportionally much worse than a top-end model, whereas the difference between CPUs is less pronounced. So, don't be surprised if a CPU can often run your OpenCL code as fast as a GPU can. — Oleksandr R., Feb 27 '13 at 15:08
@OleksandrR. Right - my understanding is that GPUs are (except for the very highest end) not particularly fast and run only in single precision. Their advantage lies solely in the very large number of parallel processors they typically contain. Thus, algorithms that are easily and highly parallelizable tend to benefit, while many others don't. It just so happens that images arising in complex dynamics (an area of interest for me) fall into this area. — Mark McClure, Feb 27 '13 at 15:21
@MarkMcClure CUDA does run on the 650M, I just ran it this morning. Specifically, you need to download the driver. — rcollyer, Feb 27 '13 at 15:36

score 7 · Accepted Answer · edited Jun 18 '14 at 08:32

I tested your code on my NVIDIA K6000. For some reason, the compiler insisted that I enable double-precision by inserting

# pragma OPENCL EXTENSION cl_khr _fp64: enable

at the beginning of the code (no double quotes used in the code though). For your values xRes = 1500 and yRes = 1500, here are my AbsoluteTiming results:

block_size = 32     {0.031003, Null}

block_size = 16     {0.039004, Null}

block_size =  8     {0.031003, Null}  (a repeat of the block_size = 32 timing value!)

block_size =  4     {0.040004, Null}

block_size =  2     {0.051005, Null}

These timings are not that much faster than yours.

I made the problem one-hundred times bigger, by setting xRes = 15000 and yRes = 15000, and got the following timing results:

block_size = 32     {3.360672, Null}

block_size = 16     {3.364673, Null}

block_size =  8     {3.311662, Null}

block_size =  4     {3.649730, Null}

block_size =  2     {4.830966, Null}

and a second run gave

{4.538908, Null}

To be able to repeat the OpenCL function evaluation for xres = 15000 and yres = 15000 many times, I had to evaluate

OpenCLMemoryUnload[mSet]

after each OpenCL function evaluation in order to recover the large chunk of memory that one evaluation tied up. The time for an evaluation seems to be more or less proportional to the size of the calculation. The largest block_size = 32 seems to give the fastest calculation, but I was surprised by the relative insensitivity of the calculation time to block_size.

The final graphic is indeed groovy, when it is rendered in 2-3 seconds for an xres = 1500 AND yres = 1500 case, but it's not so groovy to wait 2.5 minutes for an xres = 15000 AND yres = 15000 case to render. The big cases do eventually get rendered, but of course they look essentially the same as the smaller cases.

Thanks for posting. I had never run an OpenCL program before. I've only used CUDA.

GPU user experiences

1 Answers1