14

I have a dataset ds with 3 columns named A, B, C. Columns A and B have repeated values. How can I obtain another dataset that contains a list of the values in C for each of the unique combinations of A and B?

For example,

ds = Dataset[
    {<| "A" -> 2, "B" -> 3, "C" -> 100 |>,
     <| "A" -> 2, "B" -> 4, "C" -> 200 |>,
     <| "A" -> 2, "B" -> 3, "C" -> 300 |>}]

I want to get

Dataset[
 {<| "A" -> 2, "B" -> 3, "Clist" -> {100, 300} |>,
  <| "A" -> 2, "B" -> 4, "Clist" -> {200} |>}]

How can I do that?

Kuba
  • 136,707
  • 13
  • 279
  • 740
amrods
  • 637
  • 3
  • 13

3 Answers3

11

To get a list of C values for each A/B combination:

ds[GroupBy[{#A, #B}&] /* Values
, <| "A" -> Query[First, "A"]
   , "B" -> Query[First, "B"]
   , "CList" -> Query[All, "C"]
   |>
]

dataset screenshot

This is not as succinct as SQL's GROUP BY operator, but it does allow us to easily perform multiple aggregations if desired:

ds[GroupBy[{#A, #B}&] /* Values
, <| "A" -> Query[First, "A"]
   , "B" -> Query[First, "B"]
   , "CList" -> Query[All, "C"]
   , "CMean" -> Query[Mean, "C"]
   , "CMin" -> Query[Min, "C"]
   , "CMax" -> Query[Max, "C"]
   |>
]

dataset screenshot

WReach
  • 68,832
  • 4
  • 164
  • 269
8

Failed to find a duplicate:

by = {"A", "B"};

Values @ GroupBy[ds, Query[by], MapAt[First, List /@ by] @* Merge[Identity]]

enter image description here

or:

ds // GroupBy[Query[{"A", "B"}] -> (#C &)] // KeyValueMap[<|#, "Clist" -> #2|> &]

ds // GroupBy[Query[{"A", "B"}]] //  KeyValueMap[<|#, "Clist" -> #2[[;; , "C"]]|> &]
Kuba
  • 136,707
  • 13
  • 279
  • 740
  • Thanks for an accept but it is always good to hold on a day or two. Let's not discourage others. – Kuba Jul 26 '16 at 09:07
  • awesome, what if I wanted to "reduce" the resulting dataset by computing say the mean of the list instead? – amrods Jul 26 '16 at 09:07
  • @amrods Then take a look at the first topic linked in comments. It seems to be exactly that.Merge[Mean] is nice in a way that it will automatically reduce A and B lists to the repeating element. Here we have a more general case where I have to use First manually. – Kuba Jul 26 '16 at 09:09
  • 3
    Thanks. I really thought there would be a cleaner way of achieving this, since I think is a fairly common operation on a dataset. – amrods Jul 26 '16 at 09:21
5

You may use GroupBy and Merge.

ds[GroupBy[{#"A", #"B"} &] /* Values, 
 Merge[Identity] /* Query[{"A" -> First, "B" -> First}]]

enter image description here

Hope this helps

Edmund
  • 42,267
  • 3
  • 51
  • 143