How can I add a column into a existing Dataset?

Question

I'm very excited about the brand new Dataset function. I have played with it in Wolfram Cloud, and haven't figured out how can I add a new column into an existing Dataset. Here is an example:

data={<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>};
ds=Dataset[data]

Now I can play with ds columns. For example, I can easily make calculations between columns using their names like:

ds[All, (#col1+#col2&)]

{3, 7, 11}

Another way is:

ds[All, <|"col3"-> (#col1+#col2&)|>]

<|col3->3,col3->7,col3->11|>

Now, how can I update ds, to append the brand new calculated column as "col3"? I tried:

Join[ds,ds[All, <|"col3"-> (#col1+#col2&)|>],2]

without success. It would be magic if I could just do something like:

ds[All, "col3"]=ds[All, (#col1+#col2&)]

But it does not work either.

It would be nice to highlight Simplex's association examples in the title/tags of this question as afaik its the only example I have found on the site of adding a "Column" to an association. — Gordon Coale, Jan 02 '15 at 11:16

score 89 · Accepted Answer · edited Mar 31 '18 at 11:34

Here are a few ways, each of which operates upon the individual component associations. In the following discussion, recall that when a key name is not a valid symbol we can write, for example, #["col_name"] instead of #col.

We can explicitly construct a new association that includes all of the old columns and adds a new one:

ds[All, <| "col1"->"col1", "col2"->"col2", "col3"->(#col1 + #col2&) |>]
(* col1 col2 col3
   1    2    3
   3    4    7
   5    6    11
*)

This has the disadvantage that we have to list all of the existing columns. To avoid this, we can use Append:

ds[All, Append[#, "col3" -> #col1 + #col2]&]
(* col1 col2 col3
   1    2    3
   3    4    7
   5    6    11
*)

Should we wish to add multiple computed columns, we can use Join:

ds[All, # ~Join~ <| "col3" -> #col1 + #col2, "col4" -> #col1 * #col2 |> &]
(* col1 col2 col3 col4
   1    2    3    2
   3    4    7    12
   5    6    11   30
*)

By exploiting the fact that <| ... |> syntax can be nested:

<| <| "a" -> 1 |>, "b" -> 2 |>
(* <| "a" -> 1, "b" -> 2 |> *)

... we can append columns to the dataset's associations using a shorter form:

ds[All, <| #, "col3" -> #col1 + #col2, "col4" -> #col1*#col2 |> &]
(* col1 col2 col3 col4
   1    2    3    2
   3    4    7    12
   5    6    11   30
*)

2017 Update: It has been observed that the shorter form is not explictly mentioned in the documentation for Association (as of V11.1, see comments 1 and 2 for example). The documentation does mention that lists are "flattened out":

<| {"x" -> 1, "y" -> 2} |>
(* <| "x" -> 1, "y" -> 2 |> *)

... and that all but the last occurrence of repeated keys are ignored:

<| {"x" -> 1, "y" -> 1}, "y" -> 2 |>
(* <| "x" -> 1, "y" -> 2 |> *)

The documentation also frequently says that associations can be used in place of lists in many functions. It should come as no surprise that Association itself allows us to use an association in place of a list:

<| <| "x" -> 1, "y" -> 2 |> |>
(* <| "x" -> 1, "y" -> 2 |> *)

<| <| "x" -> 1, "y" -> 1 |>, "y" -> 2 |>
(* <| "x" -> 1, "y" -> 2 |> *)

This last expression is the "shorter form" from above.

Notwithstanding that the documentation strongly suggests that the short form is valid, I agree with commentators that it would be better if the documentation explicitly discussed the construction.

Cool! Tks. I'll not accept now just to respect the protocol +1 — Murta, Jun 25 '14 at 02:46
@WReach, that's a great answer. Eventually, though, you should be able to write dataset[[All, "foo"]] = {...} and have that just work (as long as the list is the right length). — Taliesin Beynon, Jul 12 '14 at 07:59
@TaliesinBeynon This does not yet work in v10.1. Is it still planned? — Szabolcs, May 19 '15 at 12:03
@Szabolcs this requires kernel hooks that I haven't had enough time to actually implement. It is still planned, though not for 10.2. — Taliesin Beynon, May 21 '15 at 15:12
@TaliesinBeynon I´m studying R data.table, and now I understand why they are so popular. These kind of operations are much simpler to be performed. In data.table syntax, add a new column would be as simpler as: ds[, col3 := col1+col2], and the values would be changed by reference. No need to do ds = ds[, col3 := col1+col2]. Here is a data.table sheet if nice ideas I miss in Dataset — Murta, Jul 04 '15 at 19:17
Hhi, Sometimes, one cannot use #col1, for example, col1_name, Is there simpler method to do such a thing? like pandas dataframe?col3=dataSetFinal["col1"]/dataSetFinal["col2"] — HyperGroups, Apr 10 '16 at 04:52
@HyperGroups When a key name is not a valid symbol we can write, for example, #["col_name"]. — WReach, Apr 10 '16 at 18:56

Simplex · Answer 2 · 2014-09-16T06:03:25.513

Another way that works (for one or more columns) is:

 ds[All, <|#, "col3" -> #col1 + #col2, "col4" -> #col1 - #col2|> &]

This gives:

dataset result

Also, sometimes the values for the new column might not be straightforwardly computed row by row. For example, you might have calculations like this:

newcol = RotateLeft @ Normal[ds[All, (#col1 + #col2 &)]] 
(* {7, 11, 3} *)
newcol2 = RandomSample[newcol]
(* {3, 11, 7} *)

In this case you can use the Association form of the Dataset. Creating a Dataset with an extra column can be done like this:

MapThread[Append, {Normal[ds], Thread["newcol" -> newcol]}] // Dataset
(* col1 col2 newcol
   1    2    7
   3    4    11
   5    6    3
*)

One way to add multiple columns is:

Join[Normal[ds], Association /@ Thread["newcol" -> newcol], 
       Association /@ Thread["newcol2" -> newcol2], 2] // Dataset
(* col1 col2 newcol newcol2
   1    2    7      3
   3    4    11     11
   5    6    3      7
*)

its worth noting that your very first example doesn't seem to work in 10.0.2 — Gordon Coale, Jan 08 '15 at 23:29

score 11 · Answer 3 · answered Jul 18 '18 at 13:37

11

Update 2018. In fact

Join[ds,ds[All, <|"col3"-> (#col1+#col2&)|>],2]

is working in Mathematica 11.0.0.0

answered Jul 18 '18 at 13:37

Diego Restrepo

211
2
5

Sjoerd Smit · Answer 4 · 2018-04-15T10:13:18.843

I recently ran into this problem as well, especially when the need arises to append a whole new column with data from an entirely different source. Suppose, for example, that we have:

data = {<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>};
ds = Dataset[data]

and that we also have a list with new data:

newdata = {3.14, 2.718, 1.618};

To add the new data as a column named "col3" to the dataset, I came up with a method based on transposing the dataset (it's interesting to note that Datasets that consist of lists of associations can be transposed, but ordinary lists of associations cannot):

addColumnToDataset[dataset_Dataset, column : _List, columnName : _String] /; Length[column] === Length[dataset] :=
    addColumnToDataset[dataset, {column}, {columnName}];

addColumnToDataset[dataset_Dataset, columns : {__List}, columnNames : {__String}] /; And[
    Length[columnNames] === Length[columns],
    Length[dataset] === Dimensions[columns][[2]]
] := Dataset @ Transpose @ Join[
        Transpose[dataset],
        Dataset[AssociationThread[columnNames, columns]]
    ];

The new column can now be added as follows:

addColumnToDataset[ds, newdata, "col3"]

edit Interestingly, I just realized that you can also add columns by using Part-assignment:

data[[All, "col3"]] = newdata;
Dataset[data]

However, it seems this only works on lists-of-associations like data and not on Datasets like ds. You can get around that by switching back and forth between the two with Normal and Dataset. And if you don't like in-place modification, you can always use Block or Module to create a temporary variable on which you do the in-place modification.

Sjoerd, there is a function AssociationTranspose in the package GeneralUtilites doing the same: Needs["GeneralUtilities`"]; a = <|"a" -> <|1 -> "A", 2 -> "B"|>, "b" -> <|1 -> "C", 2 -> "D"|>|>; AssociationTranspose[a]; Transpose[Dataset[a]] == Dataset[AssociationTranspose[a]]. Overall, a look into GeneralUtilites may be very worthwhile, in particular for Association-related functions. — Henrik Schumacher, Apr 15 '18 at 16:26

score 4 · Answer 5 · answered Jul 21 '21 at 13:04

If the dataset is a simple "list of associations", and if the column exists as a separate variable, this method is nice and clean:

ds = Dataset@{
    <| "name" -> "alice", "age" -> "32" |>, 
    <| "name" -> "bob",   "age" -> "25" |>
}

c = {"los angeles", "new york"}  (* new column vals to add *)
ds = ds // Transpose // Append["city" -> c] // Transpose

Results in:

It also has the benefit of being idempotent (re-running it won't tack on multiple duplicated columns).

score 3 · Answer 6 · answered Apr 10 '16 at 04:52

Sometimes, one cannot use #col1, for example, col1_name

data = AssociationThread[{"col1", "col2_name"}, #] & /@ {{1, 2}, {3, 4}};

dataSet = Dataset@data

divide[x_, y_] := If[y == 0, 0, Divide[x, y]] // N

rate["new_column"] = divide[#["col1"], #["col2_name"]] &;

dataSetFinal = 
 dataSet[All, 
Association[{"col1" -> "col1", "col2_name" -> "col2_name", 
"col_rate" -> rate["new_column"]}]]

ps1 · Answer 7 · 2018-04-02T10:17:25.687

The answer of bobthechemist seems to work fine as long you have no named rows and only named columns as in (col1, col2 col3). If you have both named column's and named rows it slightly different. I take the same example of bobthechemist but now with named rows too. So instead of

data={<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>}

I have

data = <|"1" -> <|"col1" -> 1, "col2" -> 2|>, "2" -> <|"col1" -> 3, "col2" -> 4|>,"3" -> <|"col1" -> 5, "col2" -> 6|>|>

so the rows have row number 1 to 3. Notice too that the bracket's in data are removed (strangely enough you need then if you have no row names but with row names you have to omit them). Now you can append col3 being the addition col1 and col2

ds3 = Dataset[data];
ds4 = ds3[All, Append[#, "col3" -> #col1 + #col2] &]

How can I add a column into a existing Dataset?

7 Answers7

Linked