2

Context

The decreasing price of high-throughput genome sequencing has skyrocketed its use.

Rise in RNA-seq publications

While Mathematica is a remarkable language, it does lag behind in some regards for various reasons; notably, Wolfram does not seem to have the same scale in community that other languages such as R has. Thus R users have seen the development of entire suites for processing genomic data, e.g. Bioconductor, Cufflinks, TopHat, Bowtie, Bowtie2, Samtools, STAR, BED tools, etc, etc.

While I will not attempt to recreate these tools in Mathematica, it may be of interest to the Bioinformaticians of the Wolfram community to have a collected source of useful questions, answers, functions, etc. that can be used following processing in the established pipelines.

It seems Mathematica's bioinformatics side is actually quite underdeveloped. Most of the features toted for Bioinformatics are applicable to most fields, and other than a hand full of functions like GenomeData[], I could only find one differentially expression package, here: Maayan Lab.


RLink for established packages

Sequence Processing

Gene List Pre-Processing

Differential Expression Analysis

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
SumNeuron
  • 5,422
  • 19
  • 51
  • 3
    This seems very broad -- can you narrow down the question? If the broad scope is intentional, then perhaps this would be better suited as a blog post? – WReach Oct 14 '16 at 15:06
  • 1
    @WReach I havent seen other blog posts before so possible. I have, however, seen other broad questions like this that serve as central hubs for directing to specific sub-answers. Mr. Wizard does this quite often :) – SumNeuron Oct 15 '16 at 06:49
  • 1
    The difference is that Mr Wizard normally focuses on the Mathematica context not the subject context. – Gordon Coale Oct 17 '16 at 07:21
  • @GordonCoale true. So what do you recommend? – SumNeuron Oct 17 '16 at 08:43

1 Answers1

3

Gene List Pre-Processing

ABOUT:

In this example we have two gene lists, both from mice. One gene list contains columns of features, the other has condition replicates.

GOAL:

The goal is to

  1. Convert the Ensembl IDs to gene names
  2. Replaces the Ensembl IDs in our Datasets with gene names
  3. Find those genes common to both Datasets and use only those records
  4. Find duplicate gene names and average those rows together
  5. Find replicates in our columns and merge those together
  6. Combine these datasets into a single Dataset for easier handling

NOTE:

This example uses the code described and demonstrated here:

The example data (because it is "clunky") is at the bottom of this post. An image of it is posted here for convenience.

example data


Get our genes

geneList = 
  Normal@With[{slot = First@Normal@Keys@First@geneListFile}, 
    geneListFile[All, Slot[slot] &]];
geneList2 = 
  Normal@With[{slot = First@Normal@Keys@First@geneListFile2}, 
    geneListFile2[All, Slot[slot] &]];

enter image description here


Set up for ID conversion

speciesCommonName = "mouse";
speciesCommonNameAssociations = <|"HUMAN" -> "hg", "MOUSE" -> "mm", 
   "YEAST" -> "sc", "RAT" -> "rn"|>;
speciesEncodedName = 
  Lookup[ToUpperCase@speciesCommonName]@
   speciesCommonNameAssociations;
referenceFileName = 
  First@FileNames[___ ~~ speciesEncodedName ~~ ___ ~~ ".csv", 
    NotebookDirectory[]];

Load reference file

referenceFile = SemanticImport@referenceFileName;
keys = Normal@Keys@First@referenceFile;
data = referenceFile[All, 
   Select[keys, StringContainsQ[#, "Gene"] &]];
\[ScriptCapitalA] = 
  AssociationThread[#[[;; , 1]], #[[;; , 2]]] &@Normal@data[Values];

Convert Ensembl IDs to gene names

genes = Lookup[\[ScriptCapitalA], #] & /@ geneList;
genes2 = Lookup[\[ScriptCapitalA], #] & /@ geneList2;

enter image description here

Replace Ensembl ID with gene name

d1 = Dataset@
   MapThread[
    Prepend,  { Normal@geneListFile, Thread["Genes" -> genes]}];
d2 = Dataset@
   MapThread[
    Prepend,  { Normal@geneListFile2, Thread["Genes" -> genes2]}];

enter image description here


Functions for merging data

DealWithDuplicates[data_] := 
 Module[{keys, dupicateValues, duplicatePositions, 
   duplicatesAveraged}, keys = Normal@Keys@First@data;
  dupicateValues = 
   If[Length[#] > 1, First@#, Nothing] & /@ 
    Split@Normal@data[All, #[First@keys] &];
  duplicatePositions = 
   Flatten[#] & /@ (Position[Normal@data[All, #[First@keys] &], #] & /@
       dupicateValues);
  duplicatesAveraged = 
   data[duplicatePositions[[#]]][Mean] & /@ 
    Range[Length@duplicatePositions];
  Return[{duplicatePositions, duplicatesAveraged}]]

ReplaceDuplicatesWithMean[data_, duplicatePositions_, duplicateAveraged_] := Module[{temp}, temp = data; Table[temp = ReplacePart[ temp, {First@duplicatePositions[[i]]} -> Normal@duplicateAveraged[[i]]], {i, Length@duplicateAveraged}]; Return[temp];]

DeleteDuplicatesNotAveraged[data_, duplicatePositions_, duplicateAveraged_] := Module[{minus, temp}, temp = data; temp = Delete[temp, duplicatePositions[[#, 2 ;;]] & /@ Range@Length@duplicatePositions]; Return[temp]]

mergeData[Data_List] := Module[{keys, data, common, dupData}, data = Data; keys = Normal@Keys@First@data[[#]] & /@ Range@Length@data; data = Table[ With[{key = keys[[d]]}, data[[d]][SortBy[#[First@key] &]]], {d, Length@data}]; common = Intersection[ Table[With[{key = keys[[d]]}, data[[d]][All, #[First@key] &]], {d, Length@data}]]; data = Table[ With[{key = keys[[d]]}, data[[d]][Select[MemberQ[common[[d]], #[First@key]] &]]], {d, Length@data}]; dupData = Table[DealWithDuplicates[data[[d]]], {d, Length@data}]; data = Table[ ReplaceDuplicatesWithMean[data[[d]], dupData[[d]][[1]], dupData[[d]][[2]]], {d, Length@data}]; data = DeleteDuplicatesBy[#, First] & /@ data; Return[JoinAcross @@ Append[data, First@First@keys]]]


Merge Data

Bio = mergeData[{d1, d2}]

enter image description here

bioKeys = Normal@Keys@First@Bio
conditions = {"condition1", "condition2"};
replicates = 
 Flatten@Position[bioKeys, #] & /@ 
    Flatten@StringCases[bioKeys, conditions[[#]] ~~ __] & /@ 
  Range@Length@conditions;

mergedReplicates = Bio[All, Flatten[bioKeys[[#]] & /@ replicates[[#]], 1] /* <| conditions[[#]] -> Mean|>] & /@ Range@Length@conditions;

Bio = Bio[All, Delete[Partition[Flatten@replicates, 1]]]; Table[Bio = Dataset@MapThread[ Append, {Normal@Bio, Thread[Normal@mergedReplicates[[i]]]}], {i,Length@mergedReplicates}];

enter image description here

Example Data

geneListFile=Dataset@{<|"Genes" -> "ENSMUSG00000102693", "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000064842", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000051951", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000102851", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000103377", 
  "Feature1" -> 60, "Feature2" -> 35, 
  "Feature3" -> 75|>, <|"Genes" -> "ENSMUSG00000104017", 
  "Feature1" -> 550, "Feature2" -> 360, 
  "Feature3" -> 560|>, <|"Genes" -> "ENSMUSG00000103025", 
  "Feature1" -> 7, "Feature2" -> 4, 
  "Feature3" -> 3|>, <|"Genes" -> "ENSMUSG00000089699", 
  "Feature1" -> 36, "Feature2" -> 34, 
  "Feature3" -> 49|>, <|"Genes" -> "ENSMUSG00000103201", 
  "Feature1" -> 144, "Feature2" -> 107, 
  "Feature3" -> 206|>, <|"Genes" -> "ENSMUSG00000103147", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000103161", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 1|>, <|"Genes" -> "ENSMUSG00000102331", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000102348", 
  "Feature1" -> 16, "Feature2" -> 10, 
  "Feature3" -> 15|>, <|"Genes" -> "ENSMUSG00000102592", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000088333", 
  "Feature1" -> 0, "Feature2" -> 1, 
  "Feature3" -> 1|>, <|"Genes" -> "ENSMUSG00000102343", 
  "Feature1" -> 2, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000102948", 
  "Feature1" -> 0, "Feature2" -> 0, 
  "Feature3" -> 0|>, <|"Genes" -> "ENSMUSG00000025900", 
  "Feature1" -> 28, "Feature2" -> 23, 
  "Feature3" -> 30|>, <|"Genes" -> "ENSMUSG00000104123", 
  "Feature1" -> 2, "Feature2" -> 3, "Feature3" -> 0|>}

geneListFile2=Dataset@{<|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000064842", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000102851", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 60, "condition1_2" -> 35, "condition1_3" -> 75, "condition2_1" -> 60, "condition2_2" -> 35|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 550, "condition1_2" -> 360, "condition1_3" -> 560, "condition2_1" -> 550, "condition2_2" -> 360|>, <|"Genes" -> "ENSMUSG00000103025", "condition1_1" -> 7, "condition1_2" -> 4, "condition1_3" -> 3, "condition2_1" -> 7, "condition2_2" -> 4|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 36, "condition1_2" -> 34, "condition1_3" -> 49, "condition2_1" -> 36, "condition2_2" -> 34|>, <|"Genes" -> "ENSMUSG00000103201", "condition1_1" -> 144, "condition1_2" -> 107, "condition1_3" -> 206, "condition2_1" -> 144, "condition2_2" -> 107|>, <|"Genes" -> "ENSMUSG00000103147", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000103161", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 1, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 16, "condition1_2" -> 10, "condition1_3" -> 15, "condition2_1" -> 16, "condition2_2" -> 10|>, <|"Genes" -> "ENSMUSG00000102592", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000100538", "condition1_1" -> 0, "condition1_2" -> 1, "condition1_3" -> 1, "condition2_1" -> 0, "condition2_2" -> 1|>, <|"Genes" -> "ENSMUSG00000101117", "condition1_1" -> 2, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 2, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000100204", "condition1_1" -> 0, "condition1_2" -> 0, "condition1_3" -> 0, "condition2_1" -> 0, "condition2_2" -> 0|>, <|"Genes" -> "ENSMUSG00000084668", "condition1_1" -> 28, "condition1_2" -> 23, "condition1_3" -> 30, "condition2_1" -> 28, "condition2_2" -> 23|>, <|"Genes" -> "ENSMUSG00000102693", "condition1_1" -> 2, "condition1_2" -> 3, "condition1_3" -> 0, "condition2_1" -> 2, "condition2_2" -> 3|>}

SumNeuron
  • 5,422
  • 19
  • 51