5

Transpose is faster when it distributes Keys at the 2nd level of a Dataset than when it factors them out.

 ds[len_] := <|"a" -> Range[len], "b" -> (1/Range[len] // N)|> // Dataset

(Using N to work around an unrelated bug - will post separately)

dst[len_] := ds[len][Transpose]

Then,

{ds[5], dst[5]}

enter image description here

Normal form:

{ds[5] // Normal, dst[5] // Normal} // Column
<|a->{1,2,3,4,5},b->{1.,0.5,0.333333,0.25,0.2},c->{4,3,2,1,0}|>
{<|a->1,b->1.,c->4|>,<|a->2,b->0.5,c->3|>,<|a->3,b->0.333333,c->2|>,<|a->4,b->0.25,c->1|>,<|a->5,b->0.2,c->0|>}

Timing study:

{10, 1000, 10000, 50000} // 
  AssociationMap[<|
     "ds time" -> (ds[#] // Transpose // Timing // First), 
     "dst time" -> (dst[#] // Transpose // Timing // 
        First)|> &] // Dataset

enter image description here

Though timing is similar for smaller datasets, it grows to ~40x at 50k rows. By adding more rows this Timing ratio increases further; I've observed 100x in real world time series.

Note Transpose in dst constructor using SetDelayed contributes a small amount that doesn't affect the conclusion.

What explains this imbalance? If anything I expected that distributing Keys would take longer.

alancalvitti
  • 15,143
  • 3
  • 27
  • 92
  • Related: (83838). Possibly related, at least in concept: (25641) – Mr.Wizard May 27 '15 at 22:30
  • I'm not quite sure why yet, but replacing your Transpose with Transpose[#, AllowedHeads -> All]& and the timings are much closer... – Stefan R Jun 02 '15 at 19:36
  • 1
    Also, the slowness seems to come from the code you use for the timing, i.e. I can leave your definitions of ds and dst unchanged but in the timing study I replace the transposes as I mentioned previously, and things are miraculously faster. – Stefan R Jun 02 '15 at 19:39

0 Answers0