7

I've come across the following "JoinAcrossMerge" sufficiently often to wonder if Dataset's semantics can offer a natural implementation.

Say I have 2 datasets (d1,d2) where I want to "join across" some "fixed" (f1,f2) columns, "merge" across some other "varying" (v1,v2) columns, while ignoring the rest (ig1,ig2); i.e. given

d1 =

enter image description here

d2 =

enter image description here

how to efficiently/elegantly implement something like

JoinAcrossMerge[{d1, d2}, {"f1", "f2"}, {"v1", "v2"}, List]

to yield

enter image description here

?


definitions for d1, d2

d1 = Dataset[{
<|"f1" -> 1, "f2" -> 2, "v1" -> A, "v2" -> B, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 3, "f2" -> 4, "v1" -> C, "v2" -> D, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 5, "f2" -> 6, "v1" -> "E", "v2" -> F, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 11, "f2" -> 12, "v1" -> 12, "v2" -> 13, "ig1" -> 20, "ig2" -> 30|>}]


d2 = Dataset[{
<|"f1" -> 1, "f2" -> 2, "v1" -> U, "v2" -> V, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 3, "f2" -> 4, "v1" -> W, "v2" -> X, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 5, "f2" -> 6, "v1" -> Y, "v2" -> Z, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 13, "f2" -> 14, "v1" -> 12, "v2" -> 13, "ig1" -> 20, "ig2" -> 30|>,
<|"f1" -> 15, "f2" -> 16, "v1" -> 17, "v2" -> 18, "ig1" -> 20, "ig2" -> 30|>}]

Conclusion:

It can be done currently with varying degrees of inarticulatness/inefficiency but syntactically it appears to most naturally fall under JoinAcross although a related option, KeyCollisionFunction, doesn't possess merging capability. This is temporarily rectified here by defining MyJoinAcross with a potentially more encompasing KeyCoalescingFunction option.

 fixed = {"f1", "f2"};
 varying = {"v1", "v2"};
 {d1N = d1[All, Join[fixed, varying]], 
 d2N = d2[All, Join[fixed, varying]]}
 (* ignoring the original "ig1","ig1" since these are really red herrings given possible pre-processing and/or lop-sidingly coalescing via `JoinAcross`' existing (`"Inner"`,`"Outer"` etc) semantics *)

MyJoinAcross[d1N, d2N, fixed, "KeyCoalescingFunction" -> foo ]

enter image description here

MyJoinAcross here applies JoinAcrossMerge's implementation as originally requested and as implemented by the OP in a self-answer. It is preferred here since it retains some of JoinAcross semantics although the other JoinAcrossMerge definitions contributed in answers are potentially faster and/or possess slightly different semantics. If these are preferred in a particular circumstance, KyJoinAcross (adapted from Kuba's answer) or CyJoinAcross (adapted from alancalvittii's answer) can be used instead.

Ronald Monson
  • 6,076
  • 26
  • 46

3 Answers3

5

That's ugly but works:

JoinAcrossMerge[
  datasets : {__Dataset}, across : {__String}, butAlso : {__String}, by_: Identity
] := GroupBy[
     Join @@ datasets,
     Lookup[across] -> KeyTake[Join[across, butAlso]],
     If[Length[#] == 1, <||>, <|#[[1]][[across]], Merge[#[[;; , butAlso]], by]|>] &
][Values@*DeleteCases[<||>]] // Dataset



JoinAcrossMerge[ {d1, d2}, {"f1", "f2"}, {"v1", "v2"}, foo ]

enter image description here

Kuba
  • 136,707
  • 13
  • 279
  • 740
  • 1
    small "mistake" : Join[d1,d2]->Join[Sequence @@ datasets]. And you can just get rid of DeleteCases if you replace the <||> (in the If) with Nothing ;) – SquareOne Feb 24 '16 at 21:30
  • @SquareOne I was playing with Nothing but somethimes strange things are going on inside and it evaporates ealier giving additional errors ;/ But first point is on topic :) – Kuba Feb 24 '16 at 21:39
2

Partial solution using nested Merge followed by a Join across key-value pairs:

Dataset[{d1, d2}][All, Normal][All, GroupBy[Query[{"f1", "f2"}]], 
  All, {"v1", "v2"}][
 Merge[Merge[Identity]] /* Select[Length[#v1] == 2 &] /*  
  KeyValueMap[Join]]

enter image description here

This is less than Kuba's solution. I don't know how to avoid the Select, which would break if foo is substituted for Identity. However fits OPs requirements.

How it works

This functional pattern leverages WL's exceptional key-value semantics where Keys can be general expressions as seen by this query fragment (red frame outlines desired rows):

Dataset[{d1, d2}][All, Normal][All, GroupBy[Query[{"f1", "f2"}]], 
  All, {"v1", "v2"}][Merge[Merge[Identity]]]  

enter image description here

alancalvitti
  • 15,143
  • 3
  • 27
  • 92
1

The issue originally arises from JoinAcross' KeyCollisionFunction not accommodating merging:

(note "ignored" columns are omitted here as these are red herrings I should have originally omitted in the question [but which don't change the relevance of answers] - also prepending "L-" "R-" satisfies Dataset's current depth needs in order to format)

 fixed = {"f1", "f2"};
 varying = {"v1", "v2"};

 d1N = d1[All, Join[fixed, varying]];
 d2N = d2[All, Join[fixed, varying]];

 joined = JoinAcross[d1N, d2N, fixed,"KeyCollisionFunction" -> ({"L-" <> #, "R-" <> #} &)]

enter image description here

The difficulty is that KeyCollision's function separates the values of both colliding keys into left ( L-_ ) and right ( R-_ ) columns thereby preventing you from getting at them simultaneously in the initial sweep. Perhaps this is deliberate ("non-joined keys in JoinAcross" are considered to "collide" as opposed to being "mergable") or perhaps merging is slated for later releases; at any rate we can get at both values in a subsequent, "merging sweep" (albeit less efficiently than desirable).

Now we could sweep across each row coalescing each ( L-_ ) and ( R-_ ) pair while sweeping down or possibly more efficiently by doing this coalescing once by first transposing. A helper Collapse function collapses a specified pair, Fold does the job for all pairs before transposing back - code for which appears at the post's end:

enter image description here

We have called this implementation MyJoinAcrossMerge so that as originally requested we have

MyJoinAcrossMerge[{d1N, d2N}, fixed, varying, foo ]

enter image description here

This usage really does however, seem to fall more naturally into JoinAcross' syntax so for future reference we'll re-define a similar version that automatically finds the "varying" keys (the "non-fixed" keys common to both) while relabelling with a KeyCoalescingFunction that seems to better connote this broader capture:

MyJoinAcross[d1_, d2_, fixed_, "KeyCoalescingFunction" -> f_] :=Where[
varying = Normal@Complement[Join @@ (Keys@First@# & /@ {d1, d2}), fixed],
MyJoinAcrossMerge[{d1, d2}, fixed, varying, f ]];

We can do a similar framing for both Kuba's and alancalvitti's answers to yield alternative versions KyJoinAcross and CyJoinAcross (see the post's end for details) before checking timings:

 {{Spacer@50, d1N, Spacer@30, d2N} // Row,
  "",
  MyJoinAcross[d1N, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming, 
  KyJoinAcross[d1N, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming,
  CyJoinAcross[d1N, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming,
  CyJoinAcross[d1N, d2N, fixed, "KeyCoalescingFunction" -> List ] //RepeatedTiming
  } // Column

enter image description here

while noting slightly different semantics exemplified here when the first dataset contains a repeated row:

{{Spacer@45, d1Na = Insert[d1N, d1N[[3]], 3], Spacer@20, d2N} // Row,
  "",
  MyJoinAcross[d1Na, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming, 
  KyJoinAcross[d1Na, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming,
  CyJoinAcross[d1Na, d2N, fixed, "KeyCoalescingFunction" -> foo ] //RepeatedTiming,
  CyJoinAcross[d1Na, d2N, fixed, "KeyCoalescingFunction" -> List ] //RepeatedTiming
  } // Column

enter image description here

The upshot is that it'd be nice to have this functionality built-in but at least these versions seem serviceable including the option to maintain some of JoinAcross's current semantics.


Definitions

    << GeneralUtilities` (* to gain a "where-with-all" *)

MyJoinAcrossMerge[datasets : {__Dataset}, across : {__String}, butAlsoMerge :{__String}, by_: Identity] := 
  Module[{wrapLeftRight, peelWrapper, Collapse},

   wrapLeftRight := Function[key, {Left@key, Right@key}];
   peelWrapper := Function[wrapped, First@First@wrapped];

   Collapse[{patt_, kf_}, mf_] := Function[assoc, Where[
      rows = KeySelect[MatchQ@patt]@assoc,
      keys = Keys@rows,
      vals = mf /@ Thread@(Values@rows),
      coll = (kf@(keys)) -> vals,
      Insert[coll, Key[First@keys]]@assoc // KeyDrop[keys]
      ]];

   (* Collapse[{*part*, *kf*} *mf*] collapses (row) keys matched by *patt* into a single key with a value equal to *mf* applied to the list of the values of the matched keys (The actual single key is defined by applying *kf* to the matched keys) *)

   Where[
    joinedT = JoinAcross[Sequence @@ datasets, across, "KeyCollisionFunction" -> wrapLeftRight]//Transpose//Normal,
    Fold[(Collapse[{Left[#2] | Right[#2], peelWrapper}, by]@#1) &, 
         joinedT,
         butAlsoMerge] // Dataset // Transpose
    ]];

   (* Now converting to JoinAcross syntax *)

    MyJoinAcross[d1_, d2_, fixed_, "KeyCoalescingFunction" -> f_] := Where[varying = Normal@Complement[Join @@ (Keys@First@# & /@ {d1, d2}), fixed],
    MyJoinAcrossMerge[{d1, d2}, fixed, varying, f ]];

(* Kuba's answer and converted to JoinAcross syntax *)

KyJoinAcrossMerge[datasets : {__Dataset}, across : {__String}, butAlso : {__String}, by_: Identity] := 
 GroupBy[Join @@ datasets, 
    Lookup[across] -> KeyTake[Join[across, butAlso]], 
    If[Length[#] == 1, <||>, <|#[[1]][[across]], 
       Merge[#[[;; , butAlso]], by]|>] &][Values@*DeleteCases[<||>]] //
   Dataset

KyJoinAcross[d1_, d2_, fixed_, "KeyCoalescingFunction" -> f_] :=

  Where[varying = 
    Normal@Complement[Join @@ (Keys@First@# & /@ {d1, d2}), fixed],
   KyJoinAcrossMerge[{d1, d2}, fixed, varying, f ]];

(* alancalvitti's answer converted to JoinAcross syntax (List merging only) *)

    CyJoinAcrossMerge[datasets : {__Dataset}, across : {__String}, 
  butAlso : {__String}, List] := 
 Dataset[datasets][All, Normal][All, GroupBy[Query[across]], All, 
   butAlso][
  Merge[Merge[Identity]] /* 
   Select[Function[row, Length[row[First@butAlso]] == 2]] /* 
   KeyValueMap[Join]]

CyJoinAcross[d1_, d2_, fixed_, "KeyCoalescingFunction" -> List] :=

  Where[varying = 
    Normal@Complement[Join @@ (Keys@First@# & /@ {d1, d2}), fixed],
   CyJoinAcrossMerge[{d1, d2}, fixed, varying, List ]];

(* alancalvitti's converted but with arbitrary merging function *)

    CyJoinAcrossMerge[datasets : {__Dataset}, across : {__String}, 
  butAlso : {__String}, by_: Identity] := 
 Dataset[datasets][All, Normal][All, GroupBy[Query[across]], All, 
    butAlso][
   Merge[Merge[If[Length@# == 2, by@#, dummy] &]] /* 
    KeyValueMap[Join]][
  Select[Function[row, Head[row[First@butAlso]] === by]]]

CyJoinAcross[d1_, d2_, fixed_, "KeyCoalescingFunction" -> f_] :=

  Where[varying = 
    Normal@Complement[Join @@ (Keys@First@# & /@ {d1, d2}), fixed],
   CyJoinAcrossMerge[{d1, d2}, fixed, varying, f ]];
Ronald Monson
  • 6,076
  • 26
  • 46