7

I'm trying to efficiently populate elements of a very large (2^20 x 2^20) symmetric matrix with 1s - luckily the matrix is very sparse, <0.1% filling. Further, the matrix has a very well defined periodic banded structure, as shown here:

enter image description here.

In reality, this matrix is the result of a series of KroneckerProducts of 2x2 matrices (or in other words, a Kronecker Sum of Pauli X matrices), which is what gives it that characteristic banded structure - I'm hoping to find a way to speed up the generation without using kronecker products, because even with sparse matrices, the computation can take several seconds or minutes depending on the final dimensionality.

My first question relates to creating this sparse matrix efficiently. I've toyed with lots of different ways of generating even simple bands for the sparse array. For simply populating on the diagonal, the quickest method clearly seems to be to use the {i_,i_} notation, as shown here:

dim = 15;

aa = SparseArray[{i_, i_} -> 1, {2^dim, 2^dim}] // RepeatedTiming;
bb = SparseArray[Band[{1, 1}] -> 1, {2^dim, 2^dim}] // RepeatedTiming;
cc = SparseArray[Table[{ii, ii} -> 1, {ii, 2^dim}], {2^dim, 2^dim}] //RepeatedTiming;
dd = SparseArray[Normal[AssociationThread[Table[{ii, ii}, {ii, 2^dim}] -> Table[1, {ii, 2^dim}]]], {2^dim,2^dim}] // RepeatedTiming;

Column[{aa[[1]], bb[[1]], cc[[1]], dd[[1]]}]

aa[[2]] == bb[[2]] == cc[[2]] == dd[[2]]
0.000309
0.031
0.081
0.054

True

However, when we try to do off-diagonal entries, this gets much worse, presumably because the condition has to be continually checked:

dim = 15;

aa = SparseArray[{i_, j_} /; j - i == 1 -> 1., {2^dim, 2^dim}] // RepeatedTiming;
bb = SparseArray[Band[{1, 2}] -> 1, {2^dim, 2^dim}] // RepeatedTiming;
cc = SparseArray[Table[{ii, ii + 1} -> 1, {ii, 2^dim - 1}], {2^dim, 2^dim}] // RepeatedTiming;
dd = SparseArray[Normal[AssociationThread[Table[{ii, ii + 1}, {ii, 2^dim - 1}] -> Table[1, {ii, 2^dim - 1}]]], {2^dim, 2^dim}] // RepeatedTiming;

Column[{aa[[1]], bb[[1]], cc[[1]], dd[[1]]}]

aa[[2]] == bb[[2]] == cc[[2]] == dd[[2]]
0.185
0.031
0.095
0.052

True

From those two examples then it seems like Band is our best choice, but Band is still painfully slow, especially when compared to the {i_,i_} for the diagonal. Further, this is more frustrating, because in MATLAB the same problem can be accomplished an order of magnitude faster (this took ~1.4 ms):

enter image description here

But the fact that the original {i_,i_} case for the diagonal was so fast suggests that there is a more efficient way to do this.

So then my first question is: given all of that, is there a more efficient way to populate the bands of our sparse matrix, so that the speed can at least rival the equivalent in MATLAB?

And my second question, a bit predicated on the first: with whatever method is the most efficient, what is the best way to generate the periodic banding structure present in the final matrix (see above). You can accomplish it with Band by manually inserting spaces with 0s, but doing so can't be the most efficient way.

Finally, because of that period-2 banded structure of the final matrix, where each quadrant is a recursive block of ever smaller diagonal matrices with side length smaller by a factor of 2, maybe you could generate all the smaller diagonal blocks, and then just place them in the final matrix - I'm not sure how this would be accomplished however. Of course, remember that the matrix is symmetric, so I would think that would help with efficient generation because really just one triangle has to be generated and then flipped.

Addendum: MATLAB code for generating the plot, as requested in the comments. This takes on the order of milliseconds for my machine, even with N=15.

N = 4; 
a_id    = (0:2^N-1)';

dimH        = length(a_id);
AA          = sparse(dimH, dimH);

for i = 1:N
    [~,k1,k2] = intersect(a_id, bitxor(a_id,2^(i-1)));
    AA        = AA + sparse(k1,k2,ones(length(k1),1)/2,dimH,dimH);
end

Addendum 2: Henrik's answer is very good, and gives what I'm looking for. Still, it's a bit disappointing that the solution is almost an order of magnitude slower than the equivalent in MATLAB, but I'll take it! As a further question, I took a stab at the method briefly mentioned above of manually placing subarrays within the master array. This takes advantage of the very quick generation of diagonal sparse matrices as I showed above. My current implementation is not very efficient, but I'm wondering if such a method has any possibility for efficiency, and if so how? This is more a curiosity than anything as Henrik's answer is already fast enough for my usecase. For n=14 this takes 3 seconds for me.

func[n_] := Module[{
   subarrays = 
    Table[SparseArray[{i_, i_} -> 0.5, {2^dim, 2^dim}], {dim, 0, 
      n - 1}],
   master = SparseArray[{}, {2^n, 2^n}]},
  Do[master[[(jj - 1) 2^(ii + 1) + 1 ;; 2^ii (2 jj - 1), 
      1 - 2^ii + 2^(1 + ii) jj ;; jj 2^(ii + 1)]] = 
    subarrays[[ii + 1]], {ii, 0, n - 1}, {jj, 1, 2^(n - 1 - ii)}];
  master + Transpose[master]
  ]

Addendum 3: In response to the comment, this is indeed for the purpose of spins on a lattice, and is simply the Kronecker sum of Pauli X matrices. The equivalent of this generation using KroneckerProduct takes 400ms for N=15 (though it's certainly possible my implementation is not the best).

KHAAAAAAAAN
  • 831
  • 5
  • 8

1 Answers1

8

For a start, I think this should provide the correct result. Do you confirm that?

cf = Compile[{{a, _Integer, 1}, {i, _Integer}},
   Transpose[{a + 1, 1 + BitXor[a, 2^(i - 1)]}],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];

ClearAll[A];
A[n_] := With[{a = Range[0, 2^n - 1]},
  SparseArray[Flatten[cf[a, Range[1, n]], 1] -> 0.5, {2^n, 2^n}, 0.]
  ]

It takes 13 ms and 1600 ms on my machine to evaluate A[15] and A[20], respectively. Is that fast enough? If not then I have also this slightly faster version that happens to produce also CRS-conform matrices (i.e., SparseArray`SparseArraySortedQ evaluates to True on the output):

cg = Compile[{{i, _Integer}, {n, _Integer}},
   Sort[1 + BitXor[i, 2^Range[0, n - 1]]],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];
B[n_] := SparseArray @@ {Automatic, {2^n, 2^n}, 0., {1, {
      Range[0, n 2 ^n, n],
      Partition[Flatten[cg[Range[0, 2^n - 1], n]], 1]
      },
     ConstantArray[0.5, n 2 ^n]
     }
    };

Evaluating B[15] and B[20] on my machine takes 8 ms and 620 ms, respectively.

Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309
  • 1
    @HenrikSchumacher, just a curiosity: how much do you gain by compiling? if you do not compile (but you parallelize) how much the performance is worse? – Dario Rosa May 30 '20 at 14:48
  • 1
    Well, compilation on its own does not really make a difference, because I call BitXor in a listable way. I really use it for its parallelism: cg[Range[0, 2^n - 1], n] is only about 4 times faster than Table[Sort[1 + BitXor[i, 2^Range[0, n - 1]]], {i, 0, 2^n - 1}] on my Quad Core. But the task is also too small to get any benefit from ParallelTable. Since the body of the CompiledFunction consists merely of calls to other libraries, there is also little difference between CompilationTarget -> "C" and CompilationTarget -> "WVM". – Henrik Schumacher May 30 '20 at 14:56
  • 1
    Thank you! My question was motivated because rarely I can get great benefits from compilation in my codes. And I was wondering whether I am losing some important points. Could you explain me why the parallelism obtained by compilation is better than the ParallelTable? – Dario Rosa May 30 '20 at 14:59
  • 5
    That's a good question. I do not why this is, but from my experience I can tell that ParallelTable has a tremendous overhead that pays off only if the tasks in its body are big enough and if only very few data has to be exchanged. It works great for scheduling tasks that run over a couple of minutes. IIRC correctly, Compile uses a lightweight OpenMP loop to do its thing: It just chops the list of tasks into as many chunks as there are threads available, tells the threads where to read from and where to write to and then lets them loose. Scales pretty well already for smaller problems. – Henrik Schumacher May 30 '20 at 15:25
  • I see. Good to know! – Dario Rosa May 30 '20 at 16:06
  • This is a great answer, and overall is fast enough for my usecast. Still a bit disappointing that this highly optimal Mathematica solution is almost an order of magnitude slower than in MATLAB, but I'll take what I get! As a final request, could you comment if the method of manually placing subarrays (detailed in Addendum 2 to the main post) has any possibility of being efficient, just as a curiosity for me? – KHAAAAAAAAN Jun 01 '20 at 06:36
  • I don't think that the method from Addendum 2 can be as fast. You write successively into a sparse matrix. In contrast to dense matrices, this is not a good idea because each time you do so, the internal structure of the sparse array has to be recomputed. Because your final matrix will have $O(2^n)$ nonzeroes, this method should have complexity way at least $O(4^n)$ or even $O(n^2 4^n)$. Really not a good idea. – Henrik Schumacher Jun 01 '20 at 07:27
  • As a followup a few days later - for the second more efficient solution you seem to be making use of a 4th argument to SparseArray. Is there any documentation on this option? I can't seem to find it, and the syntax is odd to me. – KHAAAAAAAAN Jun 03 '20 at 06:00
  • 1
    Nah, it is undocumented. It creates a sparse array directly from the row pointers and column indices and maybe also skips some consistency checks. You can find more details in this post (also follow the links posted by user21 ). – Henrik Schumacher Jun 03 '20 at 07:41