17

Consider these two datasets:

d1 = Transpose@Dataset[<|"a" -> Range[5]|>];
d2 = Transpose@Dataset[<|"b" -> Range[5, 1, -1]|>];
{d1, d2}

Mathematica graphics

What's the simplest way to concatenate them "horizontally" and get the following?

Mathematica graphics

Is there a way without explicitly extracting the contents of the datasets, i.e. resorting to Normal?

I would have expected Join[d1, d2, 2] to work, but it doesn't. Dataset@Join[Normal@d1, Normal@d2, 2] works but it's complicated. Transpose@Join[Transpose[d1], Transpose[d2]] is also complicated. For plain old matrices (lists of lists) I'd just use ArrayFlatten, which doesn't work on Datasets.

I have the same question for the case where the rows are labelled too:

d1 = Dataset[<|"x" -> <|"a" -> 1|>, "y" -> <|"a" -> 2|>, "z" -> <|"a" -> 3|>|>];
d2 = Dataset[<|"x" -> <|"b" -> 4|>, "y" -> <|"b" -> 5|>, "z" -> <|"b" -> 6|>|>];
{d1, d2}

Mathematica graphics

Assume an identical number of rows and identical row labels between d1 and d2.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • 2
    +1. I regard the Association[.] as the more fundamental data structure and I use Dataset[.] as a mere wrapper to limit large outputs. – Romke Bontekoe May 19 '15 at 14:31
  • By the way I think it could be considered a bug that Join[d1, d2, 2] does not work given that Join otherwise does. Have you filed a report? – Mr.Wizard May 20 '15 at 08:02
  • @Mr.Wizard No, but I will. – Szabolcs May 20 '15 at 08:59
  • @RomkeBontekoe, Association:brick ::Dataset:building. The functionality is sophisticated, for example, I wrote 1-line recursive Trie constructor Query to index (reconstruct) a variable-depth file system tree. – alancalvitti May 26 '15 at 21:48

3 Answers3

12

This looks nicer in a Notebook:

Join[d1\[Transpose], d2\[Transpose]]\[Transpose]

Unfortunately transposing a Dataset is very slow. Gordon Coale's alternative is much faster, but the original Dataset@Join[Normal@d1, Normal@d2, 2] is more than an order of magnitude faster than that.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
  • 2
    So I didn't overlook anything then, and this is really the shortest way. I'll just make a function for it. This problem came up when processing a dataset in two different ways, ending up with two datasets of compatible shapes but different contents (columns). It is often inconvenient and sometimes very cumbersome to redo the calculations in a way that produces a single dataset in one go. It's simpler to just continue using what I have and combine them to a single dataset. – Szabolcs May 19 '15 at 18:17
  • 1
    @Szabolcs Damn, I missed Transpose@Join[Transpose[d1], Transpose[d2]] in the question and I'm guessing five voters did too. Should I just delete this? – Mr.Wizard May 19 '15 at 18:56
  • 1
    I upvoted based on the "looks nicer part" ;) – Gordon Coale May 20 '15 at 08:00
  • @Gordon Okay :-) – Mr.Wizard May 20 '15 at 08:00
  • @Mr.Wizard No, keep it. – Szabolcs May 20 '15 at 08:58
  • I am actually surprised that Join works on Dataset-s, since the documentation states that Join works Association objects (in Details). There is no mention of Dataset-s. – Romke Bontekoe May 20 '15 at 15:01
  • 1
    @Romke Much of the new functionality still in development and largely undocumented. The best course of action is, in my opinion, to simply try stuff and see what works. – Mr.Wizard May 20 '15 at 15:32
9

This is fugly but fulfils the need of staying in the Dataset domain and is much quicker for large datasets. Basically if we use the analogy of a dataset being a SQL table - I do what I would do in the same situation. Create a dummy key on each, join, then drop the dummy key. Personally for small Datasets I prefer the @Mr.Wizard approach from a readability perspective :D

d1 = Transpose@Dataset[<|"a" -> Range[50000]|>];
d2 = Transpose@Dataset[<|"b" -> Range[50000, 1, -1]|>];

{
JoinAcross[d1[MapIndexed[Append[#1, "dummy" -> First@#2] &]], 
d2[MapIndexed[Append[#1, "dummy" -> First@#2] &]], "dummy"][All, {"a", "b"}] // AbsoluteTiming, 
(* Transpose approach *)
Join[d1\[Transpose], d2\[Transpose]]\[Transpose] // AbsoluteTiming
}

Concatentate cols

Gordon Coale
  • 2,341
  • 15
  • 20
  • 2
    This demonstrates some interesting functionality but now that I test it Szabolcs's original Dataset@Join[Normal@d1, Normal@d2, 2] is more than an order of magnitude faster on your own example. – Mr.Wizard May 20 '15 at 14:01
5

In version 12 Join[d1, d2, 2] seems to work, albeit a bit slower than Szabolcs's Dataset@Join[Normal@d1, Normal@d2, 2].

wigg0t
  • 290
  • 2
  • 6