0

I know this question may seem strange.

I'm using Spearman correlation between gene expression profiles for various reasons (I won't go into details here). As a result, I often compare RNA-Seq and Microarray samples. For preliminary analysis, I usually grab what version of the data is easily accessible (RPKM, FPKM,...), but I'd like to dig a bit more.

Intuitively, I'd think a value similar to RPKM or such would make more sense than raw counts, which is why I usually convert raw counts to RPKM (I know they're considered obsolete for statistical analysis of the data, what I'm doing relies on the correlation)

So I ask what RNA-Seq gene expression value would be closest to a microarray equivalent ?

i.e. What would (theoretically) maximize correlation between gene expression of the same sample profiled with RNA-seq and microarray. RPKM ? FPKM ? TPM ? etc.

RoB
  • 146
  • 9
  • 1
    “I know they're considered obsolete for statistical analysis of the data” — No, they’re considered obsolete for every purpose. Just don’t use them. Read https://bioinformatics.stackexchange.com/a/69/29. – Konrad Rudolph Apr 08 '20 at 17:54

2 Answers2

2

I think it is very hard to say which are the closest because they are not really comparable. But since you are using Spearman correlation, I guess RPKM, FPKM, and TPM do not change the order of gene expression levels. You might also want to normalize RNA-seq and microarray data so that they are more comparable.

Phoenix Mu
  • 857
  • 6
  • 14
2

I did a comparison of cDNA count data against microarray data that was published a few years ago:

For comparisons to published data (Fig. S2; Miller et al., 2012), a generalized linear model was fitted to the relationship between log-transformed microarray and VSTPk expression levels obtained from the ImmGen Project database, and was used to transform the microarray data into values comparable to our VSTPks.

I found that the Variance-Stabilizing Transformation that was carried out by DESeq2 was close to what I wanted, but there still seemed to be a length-based bias to the reads. I corrected this by dividing by the length of the longest gene isoform in kilobases (creating something that I called VSTPk).

After doing this, there were range differences between the microarray and cDNA data, so I did an additional linear transformation to get the data fitting as close as possible.

gringer
  • 14,012
  • 5
  • 23
  • 79