10

Say I have a list of original data (for now generating random list as follows):

r[i_] := RandomReal[{0.1, 1}];
R[i_] := RandomReal[1, 5];
data = Table[{r[j], R[j]}, {j, 10^5}];

Now, I want to divide the original data into three categories:

(i) category#1: first entry of data less than 0.2 (ii) category#2: first entry of data less than 0.5 but greater than 0.2 (iii) category#3: first entry of data less than 0.9 but greater than 0.5;

So I append them to three different lists as:

choose[ii_] := data[[ii]];
new1 = {}; new2 = {}; new3 = {};
Do[output = choose[ii];
   If[output[[1]] < 0.2, AppendTo[new1, output]];
   If[0.2 < output[[1]] < 0.5, AppendTo[new2, output]];
   If[0.5 < output[[1]] < 0.9, AppendTo[new3, output]];, {ii, 1,
     Length@datax}]; // AbsoluteTiming

This requires a huge time:

{62.3863, Null}

My practical data set is so huge that this method is of no use. I tried with ParallelDo, which seems not to collect anything.

My question is: How can I parallelize the above code and make it super fast?

Thank you in advance :))

Michael E2
  • 235,386
  • 17
  • 334
  • 747
string
  • 167
  • 5
  • 2
    Use Select? {new1, new2, new3} = {Select[data, #[[1]] < 0.2 &], Select[data, 0.2 < #[[1]] < 0.5 &], Select[data, 0.5 < #[[1]] < 0.9 &]};. You may want to use less than or equal in some of those selector functions to capture all possible values. – MarcoB Dec 13 '21 at 21:54
  • How huge is your data set? What is a typical time limit you want for this 10^5 data set? – Syed Dec 13 '21 at 22:01
  • 1
    @MarcoB, thanks. Select seems to be much faster than my original code. I am going to apply it to my original problem to see how it behaves. – string Dec 13 '21 at 22:03
  • @Syed, actual data set size is of order 10^8. My code with AppendTo has been running already for more than 24 hours, still waiting. If it can be done in several minutes, I would be happy with that. – string Dec 13 '21 at 22:06
  • 1
    You can do this with a GroupBy / GatherBy to avoid the multiple Select too like this: demux[y_] := With[{x = y[[1]]}, Which[x < 0.2, 1, 0.2 < x < 0.5, 2, 0.5 < x < 0.9, 3, True,Missing[]]] assoc = GroupBy[data, demux] and then it's just new1 = assoc[1]; new2 = assoc[2] etc. – flinty Dec 13 '21 at 22:51
  • Check out section 3.2 here – Chris K Dec 13 '21 at 23:56

2 Answers2

16

OP's method, timed on my machine:

choose[ii_] := data[[ii]];
new1 = {}; new2 = {}; new3 = {};
Do[output = choose[ii];
   If[output[[1]] < 0.2, AppendTo[new1, output]];
   If[0.2 < output[[1]] < 0.5, AppendTo[new2, output]];
   If[0.5 < output[[1]] < 0.9, AppendTo[new3, output]];, {ii, 1, 
    Length@data}]; // AbsoluteTiming
(*  {24.24, Null}  *)

Using vectorized (& autoparallelized) functions:

( cat = Evaluate@Simplify`PWToUnitStep@Piecewise[{
         {1, # < 0.2},
         {2, 0.2 < # < 0.5},
         {3, 0.5 < # < 0.9}}, 0.] &[
           Developer`ToPackedArray@data[[All, 1]]];
  {n1, n2, n3} = Pick[data, cat, #] & /@ {1, 2, 3};
  ) // AbsoluteTiming
(*  {0.0138529, Null}  *)

Check:

{n1, n2, n3} == {new1, new2, new3}
(*  True  *)
Michael E2
  • 235,386
  • 17
  • 334
  • 747
  • your code works like a charm. My written code takes more than a day to append all these data, whereas, your does does it in a second! Amazing! (thanks again) – string Dec 14 '21 at 10:29
3

The more idiomatic way instead of AppendTo is the Reap/Sow. Your dataset is quite large and you will benefit from MichaelE2's answer.

Here is a possible Reap/Sow implementation for your reference.

r[i_] := RandomReal[{0.1, 1}];
R[i_] := RandomReal[1, 5];
data = Table[{r[j], R[j]}, {j, 10^5}];

g = AbsoluteTiming@( {c1, c2, c3, oth} = Flatten[#, 1] &@(Last@Reap[ Scan[ If [First[#] <= 0.2, Sow[#, cat1], If[0.2 < First[#] <= 0.5, Sow[#, cat2], If[0.5 < First[#] < 0.9, Sow[#, cat3], Sow[#, other] ] ] ] & , data], {cat1, cat2, cat3, other}] ) )

{0.354531, {OutputSizeLimit`Skeleton[1]}}

Total@(Length /@ {c1, c2, c3, oth})   (* 100 000 *)
Syed
  • 52,495
  • 4
  • 30
  • 85