Performance issue of Merge function

Question

I am using mma 11.0.1

This post is related to my previous faster way to merge data.

enlightened by Edmund's answer, I found a subtle performance issue of Merge

First, let's define

Clear[data];
data[n_] := Module[{tmp},
  tmp = Join[RandomInteger[{1, 10}, {n, 2}], RandomReal[1., {n, 1}], 
    2];
  Thread[tmp[[;; , 1 ;; 2]] -> tmp[[;; , -1]]]]

then

test = Thread[data[1000][[;; , 1 ;; 2]] -> data[1000][[;; , -1]]];
GroupBy[test, First -> Last, Total] === Merge[test, Total]
(*True*)

now some timing

timing = Transpose@
  Table[test = 
    Thread[data[2^i][[;; , 1 ;; 2]] -> data[2^i][[;; , -1]]];
   {AbsoluteTiming[GroupBy[test, First -> Last, Total];][[1]], 
    AbsoluteTiming[Merge[test, Total];][[1]]}, {i, 1, 19}];

ListLogPlot[timing, PlotRange -> All, Frame -> True, 
 PlotLegends -> {"GroupBy", "Merge"}]

This gives

We can see that the performance of Merge is severely getting worse only when Length of list exceeds a limit. What happened?

score 4 · Answer 1 · answered Jun 10 '17 at 09:27

Your benchmark code got a little twisted up so let's start with a cleaner test.

data[n_] := Thread[RandomInteger[{1, 10}, {n, 2}] -> RandomReal[n]]

f1 = GroupBy[#, First -> Last, Total] &;
f2 = Merge[Total];

f1[#] === f2[#] & @ data[100]

Needs["GeneralUtilities`"]

BenchmarkPlot[{f1, f2}, data, TimeConstraint -> 20]

Okay, so I confirm your result in v10.1 under Windows as well.

Let's modify data to include a parameter controlling the number of unique elements, and try the test again with 1000^2 potential indexes.

ClearAll[data]
data[max_][n_] := Thread[RandomInteger[{1, max}, {n, 2}] -> RandomReal[n]]

BenchmarkPlot[{f1, f2}, data[1000], TimeConstraint -> 20]

Here Merge maintains an advantage across the test.

Let's go the other way and cause extreme duplication with only a few final "bins" for the data:

BenchmarkPlot[{f1, f2}, data[2], TimeConstraint -> 20]

I conclude from this that Merge was designed for maximum performance on sets with many unique keys, at the possible expense of performance with a limited number of keys and extreme duplication.

See Taliesin Beynon's answer to a performance question of my own for another apparent example of algorithmic trade-offs. (Specifically the last paragraph.)

Hi, @Mr.Wizard. Thank you so much for through benchmark. But I doubt your viewpoint Merge was designed for maximum performance on sets with many unique keys... . I'd rather think they screw things up in designing Merge : ) Because no matter the condition of keys, GroupBy is performing well, what is more, SparseArray is even faster. If they design it the way like that, and doesn't mention anything in the documentation. Oh my god, that is really an intentional trap, because for my example in the post, I don't think GroupBy is more natural function to pick up with than Merge: ) — matheorem, Jun 13 '17 at 06:10

Performance issue of Merge function

1 Answers1