Dataset Processing: efficient ways to clean and merge sets for Life Sciences

Question

Dataset Processing (for Life Sciences)

Note: a related, but distinct task is posted here ID Swapping: Efficient use of a reference table to convert ID values.

A common task, at least for me, involves analyzing at least two different Datasets.

A common example, at least currently in biological sciences, involve gene lists. Depending on where they came from, who processed them, which pipeline they were processed with, etc they might vary in form.

For example, consider the following two Datasets:

In Dataset One (d1) we see that there are two conditions c1 and c2 with various replicates, e.g. {c1_1, c1_2, c1_3}. In Dataset Two (d2) we do not have the issue of replicates. However, in both sets we have duplicate ids, with varying values in their columns. Biologically this might arise if one converted the transcript id (a subset of the gene) to the gene id. Lastly, not all genes are in both sets. Therefore we have some preprocessing to do:

find those IDs common to both sets
combining into a singular Dataset
find duplicates rows (e.g. same value in the ID column)
average those duplicates rows by column value
replace those duplicates with their average
find replicates in the column headers
average those columns together

i.e. applying the above transformations to given datasets, we should end up with:

There are a lot of ways to approach this problem. Below I am including what I made to answer this post. However I am sure it is not the most efficient (or most elegantly coded) method. Thus I would appreciate your assistance in finding better ways at this kind of pre-processing.

Shout out to @OneSquare 's answer on Variable named slots.

Here are the "datasets" used:

d1=Dataset@{<|"Gene" -> "a", "c1_1" -> 0.0862185, "c1_2" -> 0.591649, 
  "c1_3" -> 0.119653, "c2_1" -> 0.329605, 
  "c2_2" -> 0.953679|>, <|"Gene" -> "b", "c1_1" -> 0.0837976, 
  "c1_2" -> 0.408317, "c1_3" -> 0.427002, "c2_1" -> 0.373136, 
  "c2_2" -> 0.0670787|>, <|"Gene" -> "c", "c1_1" -> 0.331962, 
  "c1_2" -> 0.389325, "c1_3" -> 0.673205, "c2_1" -> 0.346972, 
  "c2_2" -> 0.784099|>, <|"Gene" -> "d", "c1_1" -> 0.460994, 
  "c1_2" -> 0.376045, "c1_3" -> 0.0499006, "c2_1" -> 0.165925, 
  "c2_2" -> 0.547476|>, <|"Gene" -> "e", "c1_1" -> 0.0474756, 
  "c1_2" -> 0.721516, "c1_3" -> 0.0866807, "c2_1" -> 0.754684, 
  "c2_2" -> 0.00415091|>, <|"Gene" -> "f", "c1_1" -> 0.258425, 
  "c1_2" -> 0.910458, "c1_3" -> 0.0203598, "c2_1" -> 0.267614, 
  "c2_2" -> 0.675246|>, <|"Gene" -> "c", "c1_1" -> 0.331962, 
  "c1_2" -> 0.389325, "c1_3" -> 0.673205, "c2_1" -> 0.346972, 
  "c2_2" -> 0.784099|>, <|"Gene" -> "d", "c1_1" -> 0.460994, 
  "c1_2" -> 0.376045, "c1_3" -> 0.0499006, "c2_1" -> 0.165925, 
  "c2_2" -> 0.547476|>, <|"Gene" -> "c", "c1_1" -> 0.331962, 
  "c1_2" -> 0.389325, "c1_3" -> 0.673205, "c2_1" -> 0.346972, 
  "c2_2" -> 0.784099|>, <|"Gene" -> "c", "c1_1" -> 0.331962, 
  "c1_2" -> 0.389325, "c1_3" -> 0.673205, "c2_1" -> 0.346972, 
  "c2_2" -> 0.784099|>, <|"Gene" -> "a", "c1_1" -> 0.0862185, 
  "c1_2" -> 0.591649, "c1_3" -> 0.119653, "c2_1" -> 0.329605, 
  "c2_2" -> 0.953679|>}
d2=Dataset@{<|"Gene" -> "h", "f1" -> 0.93386, "f2" -> 0.684875, 
  "f3" -> 0.599702|>, <|"Gene" -> "b", "f1" -> 0.93083, 
  "f2" -> 0.735748, "f3" -> 0.586162|>, <|"Gene" -> "j", 
  "f1" -> 0.373753, "f2" -> 0.246, 
  "f3" -> 0.150022|>, <|"Gene" -> "d", "f1" -> 0.945271, 
  "f2" -> 0.553761, "f3" -> 0.658329|>, <|"Gene" -> "k", 
  "f1" -> 0.35108, "f2" -> 0.575718, 
  "f3" -> 0.337428|>, <|"Gene" -> "f", "f1" -> 0.525761, 
  "f2" -> 0.198373, "f3" -> 0.168825|>, <|"Gene" -> "d", 
  "f1" -> 0.525761, "f2" -> 0.198373, 
  "f3" -> 0.168825|>, <|"Gene" -> "d", "f1" -> 0.525761, 
  "f2" -> 0.198373, "f3" -> 0.168825|>, <|"Gene" -> "a", 
  "f1" -> 0.525761, "f2" -> 0.198373, 
  "f3" -> 0.168825|>, <|"Gene" -> "b", "f1" -> 0.525761, 
  "f2" -> 0.198373, "f3" -> 0.168825|>}

@Kuba's approach (prior to this update), is certainly more succinct and the syntax is a bit foreign to me. It does merge the data sets together and take the mean; however, it does not handle duplicate IDs. The replicate part of this question was added during the update so naturally it was not included in his answer.

Desired exact result

The desired results on the given example data in order of transformations is as follows.

common ids of both sets: {"a", "b", "d", "f"}
combining (tack on d2 to the end of d1
find duplicate rows, e.g. in d1 there are two rows with the id "a", four with "c", etc
average those duplicates together by id, e.g. for rows with the id "a", looking at only column "c1_1", then the average would be $(0.0862184+0.0862184)/2$.
find replicates in the column header (e.g. "c1_1", "c1_2", "c1_3" are replicates of "c1")
average them together, so for "a" and replicates of "c1", $(0.0862184+0.591649+0.119653)/3$.

e.g. produces the result of the example given above

WReach · Answer 1 · 2016-10-28T03:43:59.067

7

We can get a long way by (inner) joining across the common gene keys using JoinAcross, grouping by gene, and then averaging across each group:

JoinAcross[d1, d2, "Gene"][GroupBy["Gene"] /* Values, Merge[Mean]]

What remains is to merge and average the similarly named columns. This is really the crux of this question. Here is a helper function that does the job:

averageKeys =
  KeyValueMap[<| StringReplace[#, n__~~"_"~~___ :> n] -> #2 |> &] /* Merge[Mean];

Here it is in action:

<| "Gene" -> "x"
 , "c1_1" -> 10, "c1_2" -> 40, "c1_3" -> 40
 , "c2_1" -> 10, "c2_2" -> 20
 , "f1" -> 1, "f2" -> 2, "f3" -> 3
 |> // averageKeys

(* <| "Gene" -> "x", "c1" -> 30, "c2" -> 15, "f1" -> 1, "f2" -> 2, "f3" -> 3|> *)

We can then supplement the original query to use this helper function and obtain the final result:

JoinAcross[d1, d2, "Gene"][GroupBy["Gene"] /* Values, Merge[Mean], averageKeys]

If row order matters, we can sort them after the fact:

%[Sort]

... or adjust the full query to include a sorting stage:

JoinAcross[d1,d2,"Gene"][GroupBy["Gene"] /* Values /* Query[Sort], Merge[Mean], averageKeys]

Sort is wrapped in Query to ensure that it happens after the inner operations are complete (i.e. to convert Sort from a "descending" operator into an "ascending" one).

More than two datasets

The technique can be generalized to accommodate more than two datasets. Given the additional dataset:

d3 = { <| "Gene" -> "a", "g" -> 1 |>, <| "Gene" -> "b", "g" -> 2 |>} // Dataset;

Then:

{d1, d2, d3} //
Map[Query[GroupBy["Gene"] /* Values, Merge[Mean], averageKeys]] //
Fold[JoinAcross[#, #2, "Gene"]&] //
Sort

This approach is superior to the earlier expression on two counts: 1) it supports multiple datasets and 2) it is more robust when it comes to handling duplicate keys within the source datasets.

edited Oct 28 '16 at 03:43

answered Oct 13 '16 at 04:25

WReach

68,832
4
164
269

The very first answer with JoinAcross that makes sense to me. I was wondering where this function may be useful. Up to now I considered it a retarded sister of GroupBy+Merge. :) +1 – Kuba Oct 13 '16 at 06:35
@WReach I have many questions about your answer. In no particular order, JoinAcross. My own answer, which appears to be an excessively verbose equivalent to yours, works with an arbitrary number of Dataset objects. JoinAcross requires the first two arguments be separate lists. if you had a variable d={d1,d2,...} how could you alter JoinAcross to handle that? I feel like this is a need for Fold but I never been able to get Fold to work as I wanted to. Also could you possibly elaborate on your use of composition /*? – SumNeuron Oct 21 '16 at 05:44
1

If the join criteria are identical for all datasets, we can use something like Fold[JoinAcross[##, "Gene"] &, {d1, d2, d3}]. Often it is the case that the join critieria are not identical or there are key collisions between the datasets. In such cases we need to explicitly nest the JoinAcross expressions. /* essentially chains operators together so that they are applied in order. In queries, the use can be subtle -- see (98193) for discussion. – WReach Oct 21 '16 at 14:56
@WReach unfortunately that doesn't seem to work for Dataset objects? Why are some methods unable to work with Dataset if it is just an association wrapped with a different head? – SumNeuron Oct 27 '16 at 06:14
@WReach Also what is with the double slot? – SumNeuron Oct 27 '16 at 09:08
@WReach So if I encapsulate the KeyValueMap in the operator form of Query as follows: data[All, KeyValueMap[...]/*Merge[Mean]], it works. Could you possibly break down that function though? I get the string replace patterns. What I do not understand is why 1.) you use a delayed assignment, 2.) why you wrap the string replace in an association and then use a pure function. It comes to reason that #2 is the value associated with the given Key correct? By why this notation? Also where to read more about inner? – SumNeuron Oct 27 '16 at 09:40
I added a section demonstrating the use of fold across multiple datasets. The revised approach is more robust in the face of duplicate keys. ## references all arguments, in this case it is shorthand for #, #2. I don't use delayed assignment -- perhaps you mean :>? I use it instead of -> to ensure that n is a local variable. The notation <| ... |> & is simply defining a pure function that returns an association. The term "inner" here comes from relational joins. I'm happy to continue discussion, but perhaps it should be in chat. – WReach Oct 28 '16 at 03:39
Oh, and I should mention that JoinAcross is seriously broken in version 11.0.0 but works just fine in 11.0.1 and 10.x releases -- see the comments to (129122). – WReach Oct 28 '16 at 03:48

SumNeuron · Answer 2 · 2016-10-12T17:10:48.163

Applying mergeData results in:

Define the conditions in the data set:

bioKeys = Normal@Keys@First@Bio
conditions = {"c1", "c2"};

Acquire position of replicates via string search:

replicates = 
 Flatten@Position[bioKeys, #] & /@ 
    Flatten@StringCases[bioKeys, conditions[[#]] ~~ __] & /@ 
  Range@Length@conditions

Merge replicates:

mergedReplicates = 
 Bio[All, Flatten[bioKeys[[#]] & /@ replicates[[#]], 1] /* <|
      conditions[[#]] -> Mean|>] & /@ Range@Length@conditions

Delete individual replicates:

Bio = Bio[All, Delete[Partition[Flatten@replicates, 1]]];

Add in the merged replicate columns:

Table[Bio = 
   Dataset@MapThread[
     Append, {Normal@Bio, Thread[Normal@mergedReplicates[[i]]]}], {i, 
   Length@mergedReplicates}];

Confirm

Let's look gene "a" which has both replicates and duplicates.

Getting gene "a" replicates for condition 1:

d1[All, {"Gene", "c1_1", "c1_2", "c1_3"}][Select[#Gene == "a" &]]

Averaging those rows together:

d1[All, {"Gene", "c1_1", "c1_2", "c1_3"}][
  Select[#Gene == "a" &]][Mean]

Average those columns together:

Mean[Normal[
   d1[All, {"Gene", "c1_1", "c1_2", "c1_3"}][Select[#Gene == "a" &]][
     Mean][Values]][[2 ;;]]]

and indeed that is the value in for gene "a" in column "c1".

Main Functions

mergeData[Data_List] := Module[
  {keys, data, common, dupData},
  data = Data;
  keys = Normal@Keys@First@data[[#]] & /@ Range@Length@data;
  data = Table[
    With[{key = keys[[d]]}, data[[d]][SortBy[#[First@key] &]]], {d, 
     Length@data}];
  common = 
   Intersection[
    Table[With[{key = keys[[d]]}, data[[d]][All, #[First@key] &]], {d,
       Length@data}]];
  data = Table[
    With[{key = keys[[d]]}, 
     data[[d]][Select[MemberQ[common[[d]], #[First@key]] &]]], {d, 
     Length@data}];
  dupData = Table[DealWithDuplicates[data[[d]]], {d, Length@data}];
  data = Table[
    ReplaceDuplicatesWithMean[data[[d]], dupData[[d]][[1]], 
     dupData[[d]][[2]]], {d, Length@data}];
  data = DeleteDuplicatesBy[#, First] & /@ data;
  Return[JoinAcross @@ Append[data, First@First@keys]]
  ]

Supporting Functions

DealWithDuplicates[data_] := Module[
  {keys, dupicateValues, duplicatePositions, duplicatesAveraged},
  keys = Normal@Keys@First@data;
  dupicateValues = 
   If[Length[#] > 1, First@#, Nothing] & /@ 
    Split@Normal@data[All, #[First@keys] &];
  duplicatePositions = 
   Flatten[#] & /@ (Position[Normal@data[All, #[First@keys] &], #] & /@
       dupicateValues);
  duplicatesAveraged = 
   data[duplicatePositions[[#]]][Mean] & /@ 
    Range[Length@duplicatePositions];
  Return[{duplicatePositions, duplicatesAveraged}]
  ]

ReplaceDuplicatesWithMean[data_, duplicatePositions_, 
  duplicateAveraged_] := Module[{temp},
  temp = data;
  Table[temp = 
    ReplacePart[
     temp, {First@duplicatePositions[[i]]} -> 
      Normal@duplicateAveraged[[i]]], {i, Length@duplicateAveraged}];
  Return[temp];
  ]

DeleteDuplicatesNotAveraged[data_, duplicatePositions_, 
  duplicateAveraged_] := Module[
  {minus, temp},
  temp = data;
  temp = Delete[temp, 
    duplicatePositions[[#, 2 ;;]] & /@ 
     Range@Length@duplicatePositions];
  Return[temp]
  ]

This is a long way to go for a relatively straightforward processing task. First of all if you find yourself using Table and Part, you're not using the features introduced in V10. — alancalvitti, Oct 12 '16 at 17:06
@alancalvitti That is the purpose of this post. I can do the processing task, sure. Yet it is not optimal. I am struggling to understand the internal syntax for objects with the Head Dataset . Hence the shoot out to @OneSquare for their answer on Slots (as variable named slots don't work so great). — SumNeuron, Oct 12 '16 at 17:13

Kuba · Answer 3 · 2016-10-12T12:12:53.230

I still don't know what exactly is needed but this does most of what you seem to be after:

joined = Join[d1, d2];
merged = Values @ GroupBy[joined, #d &, Merge[Mean]]

merged // MaximalBy[Length]

and this is what your function does:

mergeDatasets[d1, d2]

Except it has strange value for (here) d=5, mine is an average while yours seems to keep the first record (for other d it takes average). Maybe that was your point, sorry but I lost the track.

alancalvitti · Answer 4 · 2016-10-12T18:07:21.350

2

No need for all that, try a functional approach: this gets you 90% there - you may want to delete based on missing values to match to your preferred format

{d1 // GroupBy[Query@"Gene"], d2 // GroupBy[Query@"Gene"]} // 
    Merge[Merge[Mean]] // 
   Query[All, KeyDrop["Gene"] /* KeyMap[stringSplit["_"]] /* 
     keyGroupBy[First] /* Map[Values /* Mean]] // Dataset // Query[Transpose]

The Transpose is not necessary, but is a workaround to V11's highly compressed formatting. You can replace it with:

// keyPushValues["Gene"] where

keyPushValues[key_]:=Query[{Keys/*AssociationMap[Association[key->#1]&],Identity}]/*Query[Merge[Apply[Join]]]/*Values

to get:

Note uses:

keyGroupBy[f_][expr_]:=Association/@GroupBy[Normal[expr],Keys/*f]

And

stringSplit, the operator version:

stringSplit[str_String][expr_]:=StringSplit[expr,str]

edited Oct 12 '16 at 18:07

answered Oct 12 '16 at 17:58

alancalvitti

15,143
3
27
92

I'm going to take a while to digest your code up there. Can you please (please) describe, verbosely, your code and how it works? – SumNeuron Oct 12 '16 at 17:59
@SumNeuron, a major advantage of functional approach is that it's a pipeline, that you can segment anywhere you want. Try each step in turn to see the effect, eg try {d1 // GroupBy[Query@"Gene"], d2 // GroupBy[Query@"Gene"]} // Merge[ Merge[Mean]] // Dataset -- it's unfortunate that Dataset has to be wrapped and Normalized so many times – alancalvitti Oct 12 '16 at 18:03
1

@SumNeuron, unfortunately, Stackexchange is not conducive to tutoring, but reach me at my email, I'll be happy to give you a tutorial. In fact I'd like to start a Khan Academy thing for data. – alancalvitti Oct 12 '16 at 18:06
1

sent the email :) Also will do. Why do you think that is? this kind of dataset manipulation is common, so shouldn't there be a pretty succinct way of doing this? – SumNeuron Oct 12 '16 at 18:16
1

Also this is throwing warnings all over :( I'm using v11 if that makes any difference. – SumNeuron Oct 12 '16 at 18:22

score 1 · Answer 5 · answered Oct 13 '16 at 22:54

This answer takes a three step approach by querying the Datasetswithin Module.

Join and summarise the columns.
Get the sets of keys that need to be collapsed into one key.
Collapse the key sets from step 2 and drop the keys that form their basis.

mergeAndClean completes these steps.

mergeAndClean[ds1_Dataset, ds2_Dataset, idKey_] :=
 Module[{dsTemp, columnColapse},
  dsTemp =
   JoinAcross[ds1, ds2, idKey][
    GroupBy[#[idKey] &] /*
     Values,
    Transpose /*
     (Query[Join[{1 -> First}, (# -> Mean) & /@ Range[2, Length@Keys@#]]]@# &)];

  columnColapse = Normal@
    dsTemp[First /*
      Keys /*
      StringCases[LetterCharacter ~~ DigitCharacter ~~ "_" ~~ DigitCharacter] /*
      Apply[Join] /*
      GroupBy[StringTake[#, 2] &]];

  dsTemp[All,
   Function[{a},
     <|a, KeyValueMap[#1 -> Mean@a[[#2]] &]@columnColapse|>] /*
    KeyDrop[Flatten@Values@columnColapse]]
  ]

Then

mergeAndClean[d1, d2, "Gene"]

Mathematica graphics

Additional parameters can be added to pass in the merging function, the key pattern, and so on.

Hope this helps.