Dataset collapsing/reducing

Question

I have a dataset ds with 3 columns named A, B, C. Columns A and B have repeated values. How can I obtain another dataset that contains a list of the values in C for each of the unique combinations of A and B?

For example,

ds = Dataset[
    {<| "A" -> 2, "B" -> 3, "C" -> 100 |>,
     <| "A" -> 2, "B" -> 4, "C" -> 200 |>,
     <| "A" -> 2, "B" -> 3, "C" -> 300 |>}]

I want to get

Dataset[
 {<| "A" -> 2, "B" -> 3, "Clist" -> {100, 300} |>,
  <| "A" -> 2, "B" -> 4, "Clist" -> {200} |>}]

How can I do that?

and a slightly related: Reshaping associations, generalization of GroupBy — Kuba, Jul 26 '16 at 09:12

score 11 · Accepted Answer · answered Jul 26 '16 at 14:42

To get a list of C values for each A/B combination:

ds[GroupBy[{#A, #B}&] /* Values
, <| "A" -> Query[First, "A"]
   , "B" -> Query[First, "B"]
   , "CList" -> Query[All, "C"]
   |>
]

This is not as succinct as SQL's GROUP BY operator, but it does allow us to easily perform multiple aggregations if desired:

ds[GroupBy[{#A, #B}&] /* Values
, <| "A" -> Query[First, "A"]
   , "B" -> Query[First, "B"]
   , "CList" -> Query[All, "C"]
   , "CMean" -> Query[Mean, "C"]
   , "CMin" -> Query[Min, "C"]
   , "CMax" -> Query[Max, "C"]
   |>
]

I really think you should write a tutorial on Dataset[] at some point... I keep picking up new stuff from your answers on this topic. :) — J. M.'s missing motivation, Jul 26 '16 at 14:44

Kuba · Answer 2 · 2016-07-26T12:12:13.233

8

Failed to find a duplicate:

by = {"A", "B"};

Values @ GroupBy[ds, Query[by], MapAt[First, List /@ by] @* Merge[Identity]]

or:

ds // GroupBy[Query[{"A", "B"}] -> (#C &)] // KeyValueMap[<|#, "Clist" -> #2|> &]

ds // GroupBy[Query[{"A", "B"}]] //  KeyValueMap[<|#, "Clist" -> #2[[;; , "C"]]|> &]

edited Jul 26 '16 at 12:12

answered Jul 26 '16 at 09:03

Kuba

136,707
13
279
740

Thanks for an accept but it is always good to hold on a day or two. Let's not discourage others. – Kuba Jul 26 '16 at 09:07
awesome, what if I wanted to "reduce" the resulting dataset by computing say the mean of the list instead? – amrods Jul 26 '16 at 09:07
@amrods Then take a look at the first topic linked in comments. It seems to be exactly that.Merge[Mean] is nice in a way that it will automatically reduce A and B lists to the repeating element. Here we have a more general case where I have to use First manually. – Kuba Jul 26 '16 at 09:09
3

Thanks. I really thought there would be a cleaner way of achieving this, since I think is a fairly common operation on a dataset. – amrods Jul 26 '16 at 09:21

score 5 · Answer 3 · answered Jul 26 '16 at 10:38

5

You may use GroupBy and Merge.

ds[GroupBy[{#"A", #"B"} &] /* Values, 
 Merge[Identity] /* Query[{"A" -> First, "B" -> First}]]

Hope this helps

answered Jul 26 '16 at 10:38

Edmund

42,267
3
51
143

Dataset collapsing/reducing

3 Answers3

Linked

Related