3

for example I have a data

Clear[data];
data[n_] := 
  Join[RandomInteger[{1, 10}, {n, 2}], RandomReal[1., {n, 1}], 2];

then data[3] gives

{{4, 8, 0.264842}, {9, 5, 0.539251}, {3, 1, 0.884612}}

in each sublist, first two value is matrix index, the last is matrix element which have to be added together for same matrix index.

I want to transform the data into matrix. Usually I do it like

Clear[toSparse]
toSparse[data_] := 
 SparseArray@
  Normal@Merge[Thread[data[[;; , 1 ;; 2]] -> data[[;; , -1]]], Total]

I cared about the performance

In[171]:= toSparse[data[1000]]; // AbsoluteTiming

Out[171]= {0.00836793, Null}

In[172]:= toSparse[data[10000]]; // AbsoluteTiming

Out[172]= {0.0644464, Null}

In[173]:= toSparse[data[100000]]; // AbsoluteTiming

Out[173]= {1.35507, Null}

In[174]:= toSparse[data[1000000]]; // AbsoluteTiming

Out[174]= {200.862, Null}

Any faster way to do this?

matheorem
  • 17,132
  • 8
  • 45
  • 115
  • @Mr.Wizard The question itself, that I read as "I want to transform the data into matrix. (...). Any faster way to do this?" doesn't really have anything to do with SparseArrays, right? So the duplicate only explains the answers given so far, that use SparseArrays. The reason I point out this is that there are even faster ways than the ones presented here. – Marius Ladegård Meyer Jun 10 '17 at 10:14
  • @Marius I somehow read "I want to transform the data into matrix" as I want to transform the data into a sparse array probably because it was followed by toSparse[data_] := . . .. I'll reopen. I look forward to your answer with a faster method than "TreatRepeatedEntries"! – Mr.Wizard Jun 10 '17 at 10:20
  • @Mr.Wizard Thank you so much for the link : ) – matheorem Jun 13 '17 at 06:51

3 Answers3

5

You can change the SparseArray system options to total repeated entries instead of taking the first. Here is a function that does this:

carl[data_] := Internal`WithLocalSettings[
    old=SystemOptions["SparseArrayOptions" -> "TreatRepeatedEntries"];
    SetSystemOptions["SparseArrayOptions" -> "TreatRepeatedEntries" -> Total],

    SparseArray @ Thread[data[[All,;;2]] -> data[[All,3]]],

    SetSystemOptions[old]
]

Compare this with @edmund's solution:

edmund[data_] := SparseArray @ Normal @ GroupBy[data, Most->Last, Total]

For example:

data[n_] := Join[RandomInteger[{1,10}, {n,2}], RandomReal[1., {n,1}], 2]

d6 = data[10^6];
r1 = carl[d6]; //AbsoluteTiming
r2 = edmund[d6]; //AbsoluteTiming
MinMax[r1-r2]

{0.852608, Null}

{1.26883, Null}

{-1.00044*10^-11, 8.18545*10^-12}

The difference is due to the order in which the repeated entries are totaled in the two methods.

Carl Woll
  • 130,679
  • 6
  • 243
  • 355
4

You may use GroupBy.

Clear[toSparse]
toSparse[data_] := SparseArray@Normal@GroupBy[data, Most -> Last, Total]

Then

toSparse[data[1000]]; // AbsoluteTiming
{0.0019702, Null} 
toSparse[data[10000]]; // AbsoluteTiming
{0.0155542, Null} 
toSparse[data[100000]]; // AbsoluteTiming
{0.181737, Null} 
toSparse[data[1000000]]; // AbsoluteTiming
{2.0271, Null} 

Hope this helps.

Edmund
  • 42,267
  • 3
  • 51
  • 143
  • 2
    Hi, @Edmund. Thank you so much for the solution. I am shocked that GroupBy can be so much faster than Merge, it seems that they perform very similar operation. What cause such a big difference? – matheorem Jun 10 '17 at 02:29
2

I can't see any sparsity at all in this problem, given that we are adding values to n random elements of a 10x10 matrix, and n is up to $10^6$. So, given that we keep data as-is, the algorithm to fill the matrix is so straight-forward that it's a good candidate for compiling. I propose

makeMatrix = Compile[{{inds, _Integer, 2}, {vals, _Real, 1}},
  Block[{mat = Table[0., {10}, {10}]},
   Do[
    mat[[inds[[i, 1]], inds[[i, 2]]]] += vals[[i]]
    , {i, Length[vals]}
    ];
   mat
   ]
  , CompilationTarget -> "C"
  , RuntimeOptions -> "Speed"
  ]

and a wrapper

marius[data_] := makeMatrix[data[[All, 1 ;; 2]], data[[All, 3]]]

Then, on my machine,

d6 = data[10^6];
r1 = carl[d6]; //AbsoluteTiming
r2 = edmund[d6]; //AbsoluteTiming
r3 = marius[d6]; //AbsoluteTiming
Max[Abs[r3 - r1]]

gives

{0.967007, Null}

{1.571291, Null}

{0.305536, Null}

3.63798*10^-11

Marius Ladegård Meyer
  • 6,805
  • 1
  • 17
  • 26