More efficient matrix-vector product

Question

Dear mathematica users,

In my present research I am faced with a real dense $n\times n$ matrix $A$ where $n \geq 3000$ (hopefully even more). The coefficients of this matrix are fixed, but I will have to repeatedly multiply it by a variable vector: $Ax$.

I am not complaining about Mathematica's speed to do the job, which seems quite nice, but since I will need to do this repeatedly for very many times, I was wondering if there was a way to optimise the process. Perhaps, declaring it in a With will help, but besides that I am out of ideas.

Another alternative would be a low-rank approximation using SVD. My thoughts were: with the SVD I can write \begin{equation} A = \sigma_1 u_1 v_1^\text{T} + \sigma_2 u_2 v_2^\text{T} + \ldots \end{equation} so \begin{equation} Ax = (\sigma_1 v_1^\text{T}x) u_1 + (\sigma_2 v_2^\text{T})u_2 + \ldots \end{equation} As an example, using a rank 100 approximation to a $3000\times 3000$ matrix (which yields a Frobenius error of $\sim5-10\;\%$) I was able to reduce the computation time by a factor of roughly 3 or 4.

I thank in advance for any positive feedback.

Best regards,

Gabriel Landi

Edit: Forgot to say that $A$ is symmetric and has zero diagonal.

Do you have a graphics card that supports CUDA or OpenCL? I've found CUDADot to be very fast even with a pretty weak graphics card. — ssch, Jan 15 '13 at 13:44
@ssch the CUDA part of CUDADot is very fast, but depending on the problem the overhead for a call can make it slower than a standard Dot. — Yves Klett, Jan 15 '13 at 14:02
@YvesKlett You are right. I did some testing and I was mistaken, only with matrix-matrix multiplication was there a real advantage. Do you know more than one of the x vectors at a time so you can put them in a matrix? — ssch, Jan 15 '13 at 14:14
Improving the speed of this product will be extremely difficult since it is already highly optimized. The only way I can think of apart from CUDA is a direct call to LinearAlgebra`BLAS`GEMV. — Oleksandr R., Jan 15 '13 at 14:34
Does A have a full rank? If so, can you work in its eigenspace? Alternatively, can you calculate $A X$ where $X = (\begin{matrix}\vec{x}_1&\vec{x}_2&\cdots\end{matrix})$, instead? That would allow you to use CUDA with full matrix-matrix computations, and the eigenspace idea is still applicable. — rcollyer, Jan 15 '13 at 15:30
Hi all. Sorry for the delay in answering. Yes, $A$ will usually have full rank. It is also symmetric and has zero diagonal. But I don't know all the $x$'s in advance, so I need one dot product at a time. Thank you all for the support. — Gabriel Landi, Jan 15 '13 at 16:11
@GabrielLandi well a symmetric matrix definitely makes it easier. Then, can you generate your $\vec{x}_i$ in the eigenspace of $A$? If not, is $A$ amenable to some other form of symmetry reduction that you can exploit in generating your $\vec{x}_i$? Possibly their spherical components? — rcollyer, Jan 15 '13 at 16:28
@rcollyer What exactly do you mean by generating in eigenspace: if $A = S \Lambda S^{-1}$ then I should do $Ax = S \Lambda S^{-1} x$? Also, I read your link on spherical components, but didn't really get how that would translate to the present problem. Again, thanks for the help. — Gabriel Landi, Jan 15 '13 at 16:42
Yes you can calculate $Ax = S\Lambda S^{-1}x$, but if $x$ is already in terms of the basis of $A$, then $Ax$ reduces to {a11, a22, ...} {x1, x2, ...} which is very fast. But, I think your likely only to get a speed up, if $x$ is already in that basis. If $x$ isn't easily expressible in the basis of $A$, then it may be expressible in either terms of the group representations of $A$ or, barring that, as spherical components. Either way would reduce the computational load, if set up right. But, it depends on how $x$ is generated. — rcollyer, Jan 15 '13 at 17:03
Mostly, I'm just grasping at straws. The best bet would of course be to express $x$ in the basis of $A$, then there is nothing special you have to do, and it will be very fast. The other two ideas require a bit of work, which may not give you the speed up to make them worth it. — rcollyer, Jan 15 '13 at 17:13
Related question: Efficiently Constructing Rank One Approximations for a Matrix using SVD — Jens, Jan 15 '13 at 17:56
Anything else about the $x_{i}$? For example, $x_{i+1}=x_{i} + dx_{i}$ where $dx_{i}$ is small? Easily decompose $x_{i}$ into previous values of $x$? Just spitballing... — MikeY, Jan 21 '19 at 15:45

Henrik Schumacher · Answer 1 · 2019-05-05T15:32:33.537

SVD is an $O(n^3)$ operation (and truncated rank-$r$ factorization is probably still of complexity $O(r \, n^2)$, so pretty slow. If your matrix A has indeed low rank $r \ll n$, then Adaptive Cross Approximation (ACA) can provide a rank-$r$ factorization in $O(r^2 \, n)$ time and $O(r \, n)$ additional memory. This algorithm is greedy, so you can prescibe a given relative accucary and it will find also the required rank $r$. The resulting factors usually have a size that is only insignificantly larger than the factors provided by truncated SVD. The method is also very powerful because it needs only a few columns and rows of the matrix $A$ so that $A$ does not need to be fully assembled in memory. If course, it is not 100% reliable: Think of a sparse matrix with few nonzero rows and columns; these will quite certainly not be detected. But it works greate provided that the matrix $A$ is sufficiently smooth, (i.e., if it originates from sampling a smooth integral kernel $K(x,y)$),

Here is an example with a 400-fold speedup. It features a squared distance matrix A which really has (numerically) low rank ($r \leq 6$). Please find the code for ACACompression here.

n = 5000;
A = DistanceMatrix[RandomReal[{-1, 1}, {n, 3}]]^2;
u = RandomReal[{-1, 1}, {n}];

{U, V} = ACACompression[A, "MaxRank" -> 150, "Tolerance" -> 1. 10^-12]
U = Transpose[U];

u = RandomReal[{-1, 1}, {n}];
btrue = A.u; // RepeatedTiming // First
bproxy = U.(V.u); // RepeatedTiming // First

Max[Abs[btrue - bproxy]]

0.0080

0.000019

6.25278*10^-13

The factorization took 0.0042 seconds. For comparision: SingularValueDecomposition[A] took 23.6235 seconds on my machine.

More efficient matrix-vector product

1 Answers1