79

I'm very excited about the brand new Dataset function. I have played with it in Wolfram Cloud, and haven't figured out how can I add a new column into an existing Dataset. Here is an example:

data={<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>};
ds=Dataset[data]

Now I can play with ds columns. For example, I can easily make calculations between columns using their names like:

ds[All, (#col1+#col2&)]

{3, 7, 11}

Another way is:

ds[All, <|"col3"-> (#col1+#col2&)|>]

<|col3->3,col3->7,col3->11|>

Now, how can I update ds, to append the brand new calculated column as "col3"? I tried:

Join[ds,ds[All, <|"col3"-> (#col1+#col2&)|>],2]

without success. It would be magic if I could just do something like:

ds[All, "col3"]=ds[All, (#col1+#col2&)]

But it does not work either.

Murta
  • 26,275
  • 6
  • 76
  • 166
  • It would be nice to highlight Simplex's association examples in the title/tags of this question as afaik its the only example I have found on the site of adding a "Column" to an association. – Gordon Coale Jan 02 '15 at 11:16
  • @GordonCoale nice suggestion. Done! – Murta Jan 02 '15 at 12:55

7 Answers7

89

Here are a few ways, each of which operates upon the individual component associations. In the following discussion, recall that when a key name is not a valid symbol we can write, for example, #["col_name"] instead of #col.

We can explicitly construct a new association that includes all of the old columns and adds a new one:

ds[All, <| "col1"->"col1", "col2"->"col2", "col3"->(#col1 + #col2&) |>]
(* col1 col2 col3
   1    2    3
   3    4    7
   5    6    11
*)

This has the disadvantage that we have to list all of the existing columns. To avoid this, we can use Append:

ds[All, Append[#, "col3" -> #col1 + #col2]&]
(* col1 col2 col3
   1    2    3
   3    4    7
   5    6    11
*)

Should we wish to add multiple computed columns, we can use Join:

ds[All, # ~Join~ <| "col3" -> #col1 + #col2, "col4" -> #col1 * #col2 |> &]
(* col1 col2 col3 col4
   1    2    3    2
   3    4    7    12
   5    6    11   30
*)

By exploiting the fact that <| ... |> syntax can be nested:

<| <| "a" -> 1 |>, "b" -> 2 |>
(* <| "a" -> 1, "b" -> 2 |> *)

... we can append columns to the dataset's associations using a shorter form:

ds[All, <| #, "col3" -> #col1 + #col2, "col4" -> #col1*#col2 |> &]
(* col1 col2 col3 col4
   1    2    3    2
   3    4    7    12
   5    6    11   30
*)

2017 Update: It has been observed that the shorter form is not explictly mentioned in the documentation for Association (as of V11.1, see comments 1 and 2 for example). The documentation does mention that lists are "flattened out":

<| {"x" -> 1, "y" -> 2} |>
(* <| "x" -> 1, "y" -> 2 |> *)

... and that all but the last occurrence of repeated keys are ignored:

<| {"x" -> 1, "y" -> 1}, "y" -> 2 |>
(* <| "x" -> 1, "y" -> 2 |> *)

The documentation also frequently says that associations can be used in place of lists in many functions. It should come as no surprise that Association itself allows us to use an association in place of a list:

<| <| "x" -> 1, "y" -> 2 |> |>
(* <| "x" -> 1, "y" -> 2 |> *)

<| <| "x" -> 1, "y" -> 1 |>, "y" -> 2 |>
(* <| "x" -> 1, "y" -> 2 |> *)

This last expression is the "shorter form" from above.

Notwithstanding that the documentation strongly suggests that the short form is valid, I agree with commentators that it would be better if the documentation explicitly discussed the construction.

bobthechemist
  • 19,693
  • 4
  • 52
  • 138
WReach
  • 68,832
  • 4
  • 164
  • 269
  • Cool! Tks. I'll not accept now just to respect the protocol +1 – Murta Jun 25 '14 at 02:46
  • 10
    @WReach, that's a great answer. Eventually, though, you should be able to write dataset[[All, "foo"]] = {...} and have that just work (as long as the list is the right length). – Taliesin Beynon Jul 12 '14 at 07:59
  • @TaliesinBeynon This does not yet work in v10.1. Is it still planned? – Szabolcs May 19 '15 at 12:03
  • @Szabolcs this requires kernel hooks that I haven't had enough time to actually implement. It is still planned, though not for 10.2. – Taliesin Beynon May 21 '15 at 15:12
  • 4
    @TaliesinBeynon I´m studying R data.table, and now I understand why they are so popular. These kind of operations are much simpler to be performed. In data.table syntax, add a new column would be as simpler as: ds[, col3 := col1+col2], and the values would be changed by reference. No need to do ds = ds[, col3 := col1+col2]. Here is a data.table sheet if nice ideas I miss in Dataset – Murta Jul 04 '15 at 19:17
  • Hhi, Sometimes, one cannot use #col1, for example, col1_name, Is there simpler method to do such a thing? like pandas dataframe?col3=dataSetFinal["col1"]/dataSetFinal["col2"] – HyperGroups Apr 10 '16 at 04:52
  • 3
    @HyperGroups When a key name is not a valid symbol we can write, for example, #["col_name"]. – WReach Apr 10 '16 at 18:56
35

Another way that works (for one or more columns) is:

 ds[All, <|#, "col3" -> #col1 + #col2, "col4" -> #col1 - #col2|> &]

This gives:

dataset result


Also, sometimes the values for the new column might not be straightforwardly computed row by row. For example, you might have calculations like this:

newcol = RotateLeft @ Normal[ds[All, (#col1 + #col2 &)]] 
(* {7, 11, 3} *)
newcol2 = RandomSample[newcol]
(* {3, 11, 7} *)

In this case you can use the Association form of the Dataset. Creating a Dataset with an extra column can be done like this:

MapThread[Append, {Normal[ds], Thread["newcol" -> newcol]}] // Dataset
(* col1 col2 newcol
   1    2    7
   3    4    11
   5    6    3
*)

One way to add multiple columns is:

Join[Normal[ds], Association /@ Thread["newcol" -> newcol], 
       Association /@ Thread["newcol2" -> newcol2], 2] // Dataset
(* col1 col2 newcol newcol2
   1    2    7      3
   3    4    11     11
   5    6    3      7
*)    
Simplex
  • 1,146
  • 7
  • 13
11

Update 2018. In fact

Join[ds,ds[All, <|"col3"-> (#col1+#col2&)|>],2]

is working in Mathematica 11.0.0.0

Diego Restrepo
  • 211
  • 2
  • 5
7

I recently ran into this problem as well, especially when the need arises to append a whole new column with data from an entirely different source. Suppose, for example, that we have:

data = {<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>};
ds = Dataset[data]

and that we also have a list with new data:

newdata = {3.14, 2.718, 1.618};

To add the new data as a column named "col3" to the dataset, I came up with a method based on transposing the dataset (it's interesting to note that Datasets that consist of lists of associations can be transposed, but ordinary lists of associations cannot):

addColumnToDataset[dataset_Dataset, column : _List, columnName : _String] /; Length[column] === Length[dataset] :=
    addColumnToDataset[dataset, {column}, {columnName}];

addColumnToDataset[dataset_Dataset, columns : {__List}, columnNames : {__String}] /; And[
    Length[columnNames] === Length[columns],
    Length[dataset] === Dimensions[columns][[2]]
] := Dataset @ Transpose @ Join[
        Transpose[dataset],
        Dataset[AssociationThread[columnNames, columns]]
    ];

The new column can now be added as follows:

addColumnToDataset[ds, newdata, "col3"]

edit Interestingly, I just realized that you can also add columns by using Part-assignment:

data[[All, "col3"]] = newdata;
Dataset[data]

However, it seems this only works on lists-of-associations like data and not on Datasets like ds. You can get around that by switching back and forth between the two with Normal and Dataset. And if you don't like in-place modification, you can always use Block or Module to create a temporary variable on which you do the in-place modification.

Sjoerd Smit
  • 23,370
  • 46
  • 75
  • +1 for the tip about transposing datasets. Thanks. – Gustavo Delfino Dec 14 '17 at 14:50
  • 3
    Sjoerd, there is a function AssociationTranspose in the package GeneralUtilites doing the same: Needs["GeneralUtilities`"]; a = <|"a" -> <|1 -> "A", 2 -> "B"|>, "b" -> <|1 -> "C", 2 -> "D"|>|>; AssociationTranspose[a]; Transpose[Dataset[a]] == Dataset[AssociationTranspose[a]]. Overall, a look into GeneralUtilites may be very worthwhile, in particular for Association-related functions. – Henrik Schumacher Apr 15 '18 at 16:26
4

If the dataset is a simple "list of associations", and if the column exists as a separate variable, this method is nice and clean:

ds = Dataset@{
    <| "name" -> "alice", "age" -> "32" |>, 
    <| "name" -> "bob",   "age" -> "25" |>
}

dataset 1

c = {"los angeles", "new york"}  (* new column vals to add *)

ds = ds // Transpose // Append["city" -> c] // Transpose

Results in:

dataset 2

It also has the benefit of being idempotent (re-running it won't tack on multiple duplicated columns).

ConvexMartian
  • 1,641
  • 11
  • 18
3

Sometimes, one cannot use #col1, for example, col1_name

data = AssociationThread[{"col1", "col2_name"}, #] & /@ {{1, 2}, {3, 4}};

dataSet = Dataset@data

enter image description here

divide[x_, y_] := If[y == 0, 0, Divide[x, y]] // N

rate["new_column"] = divide[#["col1"], #["col2_name"]] &;

dataSetFinal = 
 dataSet[All, 
Association[{"col1" -> "col1", "col2_name" -> "col2_name", 
"col_rate" -> rate["new_column"]}]]

enter image description here

HyperGroups
  • 8,619
  • 1
  • 26
  • 63
2

The answer of bobthechemist seems to work fine as long you have no named rows and only named columns as in (col1, col2 col3). If you have both named column's and named rows it slightly different. I take the same example of bobthechemist but now with named rows too. So instead of

data={<|"col1"->1,"col2"->2|>,<|"col1"->3,"col2"->4|>,<|"col1"->5,"col2"->6|>}

I have

data = <|"1" -> <|"col1" -> 1, "col2" -> 2|>, "2" -> <|"col1" -> 3, "col2" -> 4|>,"3" -> <|"col1" -> 5, "col2" -> 6|>|>

so the rows have row number 1 to 3. Notice too that the bracket's in data are removed (strangely enough you need then if you have no row names but with row names you have to omit them). Now you can append col3 being the addition col1 and col2

ds3 = Dataset[data];
ds4 = ds3[All, Append[#, "col3" -> #col1 + #col2] &]
ps1
  • 71
  • 6