CUDA for linear equations

Question

TLDR; Is there a way to solve linear equations of a sparse matrix (discretized laplace operator) efficiently using CUDALink in Mathematica? I didn't find a CUDALinearSolve or CUDAMatrixInverse or something in the documentation. This is my first time working with CUDA link.

The long version: Our prof gave us a project to solve a Poisson equation (laplace Phi = 4pi * rho) in 2D. I discretized the Laplacian to an n² x n² matrix given by ({-4 for i=j, 1 for |i-j|=1 or |i-j|=n}). (*)

I get solutions in reasonable time for up to n=100, using LinearSolve[N[laplace],rho], but our programs are judged by performance and this isn't really too well performing....

Unfortunately we were specifically tasked to only use Java, Mathematica or Python, I don't know python and well, Java is Java, so Mathematica is really my only shot. But since I am basically throwing around matrices all day I though my a GPU might be better equipped to handle this and Mathematica has CUDA integration...

I looked through the CUDALink documentation and didn't find anything helpful, which is super surprising to me, since this should be a fairly standard use-case of CUDA. Does anyone have an idea what I could use?

I found cusolver in the CUDA documentation and know that CUDAFunctionImport is a thing, but I have no clue how to bring these components together

(*) - There is something weird with this discretization though, as my results seem to represent the surface of a cylinder not a flat plane, but this is besides the point

If you have something to add to your question, please do so by editing the question instead of adding a comment - the question should give anyone reading all necessary information without having to read through the comments — Lukas Lang, May 28 '18 at 22:02
10000 x 10000 is a tiny matrix is you use SparseArray. Try also LinearSolve with the option 'Method->"Pardiso". On my machine, "Pardiso" factorizes the 5-stencil Laplacian of a $1120 \times 1120$ in about 4.2 seconds while the standard solver for sparse matrices ('Method->"Pardiso") needs about 14.8 seconds for the same task. — Henrik Schumacher, May 28 '18 at 22:15
Making that work on a GPU is highly nontrivial and as far as I can tell, no reasonably working linear solvers for the GPU get shipped with Mathematica. — Henrik Schumacher, May 28 '18 at 22:16
Thanks for the tip, I will try... n=100 worked with the naivest possible code for this problem, I hope to get it up to n=1000, which seems like a tall order (and be it just for RAM, even building the laplace matrix makes me run out of RAM) — Chalky, May 28 '18 at 22:24
That's because you build it as a dense matrix. Look up SparseArray in the documentation. (my sparse Laplacian matrix is of size $1256641 \times 1256641$ and needs only 190 MB - because spare arrays store only the nonzero-entries and its positions. — Henrik Schumacher, May 28 '18 at 22:32
I coded it as a SparseArray, but I wrote Normal[SparseArray[... which was a relic from debugging, that might have been an issue — Chalky, May 28 '18 at 22:58
Thanks, that worked, btw, considering that you got a similar piece of code at your hand, I think my laplacian is somehow wrong, the results look very much like the space is a cylinder, a charge at one edge creates a field on the opposing edge. My Laplacian is given by {{i,i}->-4, {i,j}/;Abs[i-j]=n ->1, {i,j}/;Abs[i-j]=1->1}, which looks very similar to what I could find online — Chalky, May 28 '18 at 23:05
Yes, Normal converts back to a dense matrix. Now that you ask me, I realize that I was using a finite-element Laplacian which is build differently. Anyways, this one should work better: A = SparseArray[{ {i_, i_} :> -4, {i_, j_} /; Abs[i - j] == n :> 1, {i_, j_} /; j == i + 1 && Mod[i, n] != 0 :> 1, {j_, i_} /; j == i + 1 && Mod[i, n] != 0 :> 1 }, {n^2, n^2}]. Plot it for small n = 4 with MatrixPlot and see what the difference is (and think about why there has to be this difference). — Henrik Schumacher, May 28 '18 at 23:28
figured out the mod thingy by now, too, kinda hit me like a brick to the face.. the way I wrote it, it took ~30s for n=40 just to compile the laplacian, lets see how yours fairs — Chalky, May 28 '18 at 23:38
That's because we used delayed rules to construct the sparse matrix. No matter what they told you in the documentation: That's not an efficient way to build a sparse matrix. You might have considerably more fun with A = 1. AdjacencyMatrix[GridGraph[{n, n}]] - 4. IdentityMatrix[n^2, SparseArray];. Run also this GridGraph[{n, n}, VertexLabels -> "Name"] for small n to see how the vertices are numbered. — Henrik Schumacher, May 28 '18 at 23:40
Oh man, thank you so much! You have been incredibly helpful... Gotta love when the documentation lies :D
With this laplacian n=1000 is no problem (well, the plotting is a problem, but screw that :D) — Chalky, May 28 '18 at 23:48
Btw.: If you want to answer a specific user and if you want that he/she gets notified, make sure that something like @user3271145 appears in the comment. (You can ping an most one user per comment this way. Also, this does no work within a question or an answer.) — Henrik Schumacher, May 28 '18 at 23:49
@HenrikSchumacher will do so! With your code n=1000 is solved within ~6s btw, this is amazing... I hope I will find some good way to visualize the data, since the this powerful method is actually performing better than my plotting stuff :D
Aber nochmal Danke! — Chalky, May 28 '18 at 23:56
Well, plotting is slow in Mathematic, that's a known issue. The FrontEnd is bit outdated (at least the macOS version is still only a 32 bit app IIRC), pretty slow and it might crash if you plot with a too high resolution... Anyways: Bitte sehr! — Henrik Schumacher, May 29 '18 at 00:07
Closely related to the ultimate goal of the OP, not the actual question: Poisson solver using Mathematica where I posted an answer based on the relaxation method. — Jens, May 29 '18 at 01:09
You may also be interested in some alternative ways for constructing discretized Laplacians, in particular based on FiniteDifferenceDerivative: I'm listing a bunch of approaches in my answer to Schrödinger eigenvalue problem in two dimensions (Harmonic Oscillator). — Jens, May 29 '18 at 01:17

Henrik Schumacher · Accepted Answer · 2018-05-29T07:24:36.930

Turning the comments into an answer.

Making a linear solver work on a GPU is highly nontrivial (it's a task to be assigned to someone who poisened his mother and father), and as far as I can tell, no reasonably working linear solvers for the GPU are shipped with Mathematica. But one can work reasonably well with the built-in tools for SparseArrays.

The 5-stencil Laplacian is the graph Laplacian of the underlying grid graph. We can obtain it as follows.

n = 1000;
A = 1. AdjacencyMatrix[GridGraph[{n, n}]] - 
     4. IdentityMatrix[n^2, SparseArray]; // AbsoluteTiming // First

0.190387

This uses the Pardiso solver from the Intel MKL to factorize the matrix and to create a LinearSolveFunction object that we can use to perform multiple solves

S = LinearSolve[A, "Method" -> "Pardiso"]; // AbsoluteTiming // First

1.90579

Creating a right-hand side for the equation and applying the LinearSolveFunction S

b = RandomReal[{-1, 1}, n^2];
x = S[b]; // AbsoluteTiming // First
Max[Abs[A.x - b]]

0.285522

1.03473*10^-13

Creating many right-hand sides and solving the equations for all of them at once

B = RandomReal[{-1, 1}, {n^2, 100}];
X = S[B]; // AbsoluteTiming // First
Max[Abs[A.X - B]]

8.24018

1.03473*10^-13

Now you might wonder why the latter needs considerably less than 100 times the time as the former. The reason for that is that S[b] and S[B] involve so-called forward- and backward-substitutions which are not parallelizable. So S[b] runs at single-core speed, while S[B] can utilize my CPU's 4 compute cores in parallel.

CUDA for linear equations

1 Answers1