Local inversion of small matrices on GPUs?

Question

I don't know much about GPU computing at the moment, so please pardon the simple question. Can one invert local matrices in parallel on the GPU? CUBLAS doesn't seem to support factorization, and most of the LU/QR/Chol libraries I've found for GPUs aim instead to accelerate a single direct factorization.

For example, if mass matrices had to be recomputed for an explicit DG method, is there a way to reinvert them locally on the GPU (i.e. in more of an MPI fashion, computing a factorization in parallel over multiple warps/blocks/etc)?

Edit: I'm trying to see if it's possible to assemble and invert a large number of small matrices on a GPU.

Are you asking about inverting a large number of relatively small matrices? The wording of your question is a little confusing to me. — Godric Seer, Aug 09 '13 at 15:58
@GodricSeer - Yes. Apologies for the confusion, I'll edit the question. — Jesse Chan, Aug 09 '13 at 17:42
@GeoffreyIrving - Usually they'll be neq*(p+3)^d where d is spatial dimension and neq is the number of equations you're solving. Most of our 2D stuff has had matrices around 150-300 in size. — Jesse Chan, Aug 09 '13 at 17:43
Ah, those matrices are large enough that normal methods for "large" matrices are the way to go, since they are large enough that parallelizing each individual inversion is probably important. — Geoffrey Irving, Aug 09 '13 at 21:14
The solve is surprisingly not the dominant issue; we're solving something akin to B'inv(A)B where B is overdetermined over every single element, and the quadrature costs were dominant most of the time.
We were just hoping to parallelize over elements and quadrature for now. — Jesse Chan, Aug 09 '13 at 21:43

score 4 · Answer 1 · answered Aug 09 '13 at 17:59

The short answer to your question is yes, you can invert a large number of small, independent matrices on a GPU, and more than likely you can do it efficiently. The best way to go about it, however, is not such a straight forward answer. I can think of three possible implementations, although I will admit right from the start I have never attempted these, so I may overlook issues in any of them.

The simplest would be to assemble all the matrices into one larger block diagonal matrix on the gpu, then use a single matrix solve on the whole thing. This will retain the block diagonal form and simply require pulling each block out as the inverse of its respective matrix. You would need to tune your block size and thread counts to your smaller matrix sizes, and if they are different sizes, it would become fairly difficult. This is the only option that you may have a chance of a library implementing.
Create a higher level GPU kernel that takes a vector of pointers and matrix sizes, and calls a smaller kernel for each matrix. This would be a more understandable program, but assigning threads/blocks to your smaller matrices would be more complicated since the block and thread id's are assigned in the high level kernel, and not the low level one.
Create a low level kernel to solve a single matrix, and call them all in sequence from the CPU. This would likely be the best way to go about it, however you would need to include checks to ensure that all of the kernels are complete before you let the CPU continue further into the program.

Note that all of these methods could also be used for kernels that construct the matrices within the GPU memory as well, and not just solving them.

That's perfect. All the matrices should be the same size, thankfully. So there probably isn't a library which implements matrix solves in a low-level kernel? — Jesse Chan, Aug 09 '13 at 18:28
Actually, if you only want the lower level solve, then you likely could pull that from an open-source library. I have used ViennaCL as Misery suggested, however I have never launched multiple kernels at once, so I don't know how it would handle it. I do know that CUDA allows for the CPU and GPU to work at the same time, so using that backend it should be possible. — Godric Seer, Aug 10 '13 at 00:59

score 2 · Answer 2 · answered Aug 09 '13 at 08:04

2

ViennaCL has multiple frontends (OpenMP/OpenCL/CUDA) and is easy to use. It can perform QR factorization, however I don't know if it suits you, as it does onl;y a part of factorization on CUDA device. But still it is one of the friendliest libs in my opinion.:

Manual - page 32

WWW

answered Aug 09 '13 at 08:04

Misery

475
3
9

It does look friendly to use - thanks. I'm looking more to solve many linear systems in parallel though, not to parallelize a single solve. – Jesse Chan Aug 09 '13 at 21:58

Local inversion of small matrices on GPUs?

2 Answers2