Nonlinear least squares on a GPU with Mathematica 9

Question

Update: A Mathematica wrapper for (https://github.com/zitmen/cuLM) should allow for us to directly implement the Levenberg-Marquardt algorithm in CUDA for nonlinear least squares fitting.

In a previous question of mine (here) I asked how one could best use Mathematica's model fitting capabilities to fit a 2D Gaussian to a set of data.
Two very nice answers were provided by Rahul Narain (who directly computed a mean and covariance matrix for my example data and then used MultinormalDistribution[]) and Sjoerd C. de Vries (who used NonlinearModelFit).

Questions:

Could these methods be adapted to work on a GPU core in Mathematica 9 (perhaps via CUDALink)?
Is there some way of using Mathematica to do GPU-based nonlinear least squares (or some other fitting strategy) to accomplish this?

I found something in http://www5.cs.fau.de/research/software/cuda-quasi-newton-optimization/. I don't know if CUDALink can be used to port these parallel code without much hassle. — PlatoManiac, Jun 28 '13 at 08:11
@PlatoManiac Hmm I'm not so sure this will be easily to implement... though I could certainly be proven wrong. — Bob, Jun 28 '13 at 11:03
Surely it can be done, although I don't have the experience with GPU programming necessary to make any concrete recommendations. I am curious though: I assume you want this because NonlinearModelFit isn't fast enough. But, in (4700), Ajasja and I fitted a very complicated function using the (comparatively inefficient) Nelder-Mead method in not more than a few milliseconds. So, why is it that your fitting of a simple Gaussian is so slow? — Oleksandr R., Jun 28 '13 at 11:33
@PlatoManiac that code looks very useful. But, while L-BFGS is good for large-scale problems, the example given in this question doesn't seem to qualify. A 2-d Gaussian has only five or six parameters, while a "large" problem, for which one would probably like to use such methods, would usually have at least two orders of magnitude more. As such, I would question how worthwhile it really is to try to use a large-scale optimization library on this simple problem, since other methods (e.g. Levenberg-Marquardt on the CPU) should be as fast if not faster after considering setup overheads. — Oleksandr R., Jun 28 '13 at 11:47
@OleksandrR. Thanks for your comments. I'm looking to perform the fit in my previous question on a GPU so I can parallelize multiple instances over the hundreds of cores that can be had on a CUDA compatible graphics processor. You're right that each fit is fast (about three milliseconds) but I need to do millions of them quickly. — Bob, Jun 28 '13 at 12:08
@OleksandrR. The tradeoff in speed vs. number of cores seems to make a lot of sense here if an implementation of a 2D Gaussian fitter can be had on a GPU. — Bob, Jun 28 '13 at 12:19
It's important to note that a "CUDA core" is not capable of operating independently from all of the others. To what degree this task is parallelizable depends on your problem and data decomposition, but at most you will be able to perform 8-16 independent fits simultaneously (not hundreds) because this is the number of "streaming multiprocessors" available on CUDA-compatible GPUs. In that sense you are not necessarily better off with a GPU implementation than using a CPU with a similar number of cores. — Oleksandr R., Jun 28 '13 at 12:28
@OleksandrR. Very good to know, that wasn't clear to me. Still I'd love to see what's possible. — Bob, Jun 28 '13 at 12:40
@OleksandrR. The Tesla C2070 (http://www.nvidia.com/object/personal-supercomputing.html) claims to have 448 CUDA cores. How might I parse this in terms of CPU equivalents? — Bob, Jun 28 '13 at 13:26
CPU cores and CUDA streaming multiprocessors (SMs) are fundamentally different, so you can't. But the Tesla C2070 has 14 SMs, FWIW. There are 32 CUDA cores per SM in the Fermi model chips. — Oleksandr R., Jun 28 '13 at 13:34
@OleksandrR. Though the specific problem may turn out to be too small to reap any actual benefit from the GPU porting of the code I mentioned, it will be really interesting to know how one can port it to Mathematica if possible. This opens up the path for handling large scale optimization problems in Mathematica using CUDA. But we need some C and CUDA specialist here..I do not qualify! — PlatoManiac, Jun 28 '13 at 13:35

Nonlinear least squares on a GPU with Mathematica 9

0 Answers0

Linked