5

I have a huge panel data with more than 2.5 million observations and more than 15000 groups. I want to calculate group-wise mean and I also need Id variable in my result. This is what I have done.

dat = {{2, .1, .2}, {2, .2, .4}, {2, .3, .6}, {2, .4, .8}, {5, 1, 
    2}, {5, 2, 4}, {7, 20, 10}, {7, 40, 20}, {7, 60, 30}, {7, 80, 
    40}, {7, 100, 50}, {10, 30, 50}};

results =  Table[N[Mean[Select[dat, #[[1]] == i &]]], {i, {2, 5, 7, 10}}]; 

But it is taking too long time. How can we make it quicker. Also, I want to calculate group-wise max value of second column.

Any help is greatly appreciated.

ramesh
  • 2,309
  • 16
  • 29

3 Answers3

7

For the first case, the following should work:

Mean /@ GatherBy[dat, First]

OR

Mean /@ GroupBy[dat, First]

The second case gives an association with index as Keys

For the maximum by second value, try:

MaximalBy[#, #[[2]] &] & /@ GatherBy[dat, First]

OR

MaximalBy[#, #[[2]] &] & /@ GroupBy[dat, First]
RunnyKine
  • 33,088
  • 3
  • 109
  • 176
3
Values @ GroupBy[dat, First, Mean]

{{2, 0.25, 0.5}, {5, 3/2, 3}, {7, 60, 30}, {10, 30, 50}}

Values @ Merge[Mean][# -> {##} & @@@ dat]

{{2, 0.25, 0.5}, {5, 3/2, 3}, {7, 60, 30}, {10, 30, 50}}

kglr
  • 394,356
  • 18
  • 477
  • 896
2

Since you said your dataset had 2.5 million entries and 15000 groups, you might try using my GatherByList function to gain a small improvement in speed when compared to the other answers. Here is the definition of GatherByList:

GatherByList[list_, representatives_] := Module[{func},
    func /: Map[func, _] := representatives;
    GatherBy[list, func]
]

And here is a dataset with 2.5 million entries and ~15000 groups:

data = RandomInteger[15000, {2500000, 3}];

Here's a comparison with the other answers for obtaining the Mean:

r1 = Values @ GroupBy[data, First, Mean]; //AbsoluteTiming (* kglr *)
r2 = Mean /@ GatherBy[data, First]; //AbsoluteTiming (* Runnykine *)
r3 = Mean /@ GatherByList[data, data[[All, 1]]]; //AbsoluteTiming (* Carl *)

r1 === r2 === r3

{0.355567, Null}

{0.311108, Null}

{0.278283, Null}

True

And a comparison with @Runnykine's answer for obtaining the element with the group-wise max of the second column:

r1 = Last @ MaximalBy[#, #[[2]] &] & /@ GatherBy[data, First]; //AbsoluteTiming
r2 = Module[{r = GatherByList[data, data[[All, 1]]]},
    #[[First @ Ordering[#[[All, 2]], -1]]]& /@ r
]; //AbsoluteTiming

r1 === r2

{1.525, Null}

{0.275054, Null}

True

Carl Woll
  • 130,679
  • 6
  • 243
  • 355