How to parallelize and speedup AppendTo?

Question

Say I have a list of original data (for now generating random list as follows):

r[i_] := RandomReal[{0.1, 1}];
R[i_] := RandomReal[1, 5];
data = Table[{r[j], R[j]}, {j, 10^5}];

Now, I want to divide the original data into three categories:

(i) category#1: first entry of data less than 0.2 (ii) category#2: first entry of data less than 0.5 but greater than 0.2 (iii) category#3: first entry of data less than 0.9 but greater than 0.5;

So I append them to three different lists as:

choose[ii_] := data[[ii]];
new1 = {}; new2 = {}; new3 = {};
Do[output = choose[ii];
   If[output[[1]] < 0.2, AppendTo[new1, output]];
   If[0.2 < output[[1]] < 0.5, AppendTo[new2, output]];
   If[0.5 < output[[1]] < 0.9, AppendTo[new3, output]];, {ii, 1,
     Length@datax}]; // AbsoluteTiming

This requires a huge time:

{62.3863, Null}

My practical data set is so huge that this method is of no use. I tried with ParallelDo, which seems not to collect anything.

My question is: How can I parallelize the above code and make it super fast?

Thank you in advance :))

Use Select? {new1, new2, new3} = {Select[data, #[[1]] < 0.2 &], Select[data, 0.2 < #[[1]] < 0.5 &], Select[data, 0.5 < #[[1]] < 0.9 &]};. You may want to use less than or equal in some of those selector functions to capture all possible values. — MarcoB, Dec 13 '21 at 21:54
How huge is your data set? What is a typical time limit you want for this 10^5 data set? — Syed, Dec 13 '21 at 22:01
@MarcoB, thanks. Select seems to be much faster than my original code. I am going to apply it to my original problem to see how it behaves. — string, Dec 13 '21 at 22:03
@Syed, actual data set size is of order 10^8. My code with AppendTo has been running already for more than 24 hours, still waiting. If it can be done in several minutes, I would be happy with that. — string, Dec 13 '21 at 22:06
Also related: Finding all elements within a certain range in a sorted list. — MarcoB, Dec 13 '21 at 22:13
You can do this with a GroupBy / GatherBy to avoid the multiple Select too like this: demux[y_] := With[{x = y[[1]]}, Which[x < 0.2, 1, 0.2 < x < 0.5, 2, 0.5 < x < 0.9, 3, True,Missing[]]] assoc = GroupBy[data, demux] and then it's just new1 = assoc[1]; new2 = assoc[2] etc. — flinty, Dec 13 '21 at 22:51

score 16 · Accepted Answer · answered Dec 13 '21 at 23:38

OP's method, timed on my machine:

choose[ii_] := data[[ii]];
new1 = {}; new2 = {}; new3 = {};
Do[output = choose[ii];
   If[output[[1]] < 0.2, AppendTo[new1, output]];
   If[0.2 < output[[1]] < 0.5, AppendTo[new2, output]];
   If[0.5 < output[[1]] < 0.9, AppendTo[new3, output]];, {ii, 1, 
    Length@data}]; // AbsoluteTiming
(*  {24.24, Null}  *)

Using vectorized (& autoparallelized) functions:

( cat = Evaluate@Simplify`PWToUnitStep@Piecewise[{
         {1, # < 0.2},
         {2, 0.2 < # < 0.5},
         {3, 0.5 < # < 0.9}}, 0.] &[
           Developer`ToPackedArray@data[[All, 1]]];
  {n1, n2, n3} = Pick[data, cat, #] & /@ {1, 2, 3};
  ) // AbsoluteTiming
(*  {0.0138529, Null}  *)

Check:

{n1, n2, n3} == {new1, new2, new3}
(*  True  *)

your code works like a charm. My written code takes more than a day to append all these data, whereas, your does does it in a second! Amazing! (thanks again) — string, Dec 14 '21 at 10:29

score 3 · Answer 2 · answered Dec 14 '21 at 06:39

The more idiomatic way instead of AppendTo is the Reap/Sow. Your dataset is quite large and you will benefit from MichaelE2's answer.

Here is a possible Reap/Sow implementation for your reference.

r[i_] := RandomReal[{0.1, 1}];
R[i_] := RandomReal[1, 5];
data = Table[{r[j], R[j]}, {j, 10^5}];
g = AbsoluteTiming@(
   {c1, c2, c3, oth} = Flatten[#, 1] &@(Last@Reap[
        Scan[
         If [First[#] <= 0.2, Sow[#, cat1],
           If[0.2 < First[#] <= 0.5, Sow[#, cat2],
            If[0.5 < First[#] < 0.9,
             Sow[#, cat3],
             Sow[#, other]
             ]
            ]
           ] &
         , data], {cat1, cat2, cat3, other}]
      )
   )

{0.354531, {OutputSizeLimit`Skeleton[1]}}

Total@(Length /@ {c1, c2, c3, oth})   (* 100 000 *)

How to parallelize and speedup AppendTo?

2 Answers2