Standard practice for generating rarefaction curves from Next Generation Sequencing data

Question

We have a few million 18S reads from a particular environment. The reads have been clustered into Operational Taxonomic Unit (OTU), and the OTUs annotated against a reference database.

To generate a rarefaction curve, my understanding is that one randomly samples $n$ reads where $n$ ranges (with some step size) from 0 to the total number of reads, and counts the number of OTUs observed at each such sub-sampling.

Which of these two ways, as implemented by sequence analysis suites such as QIIME and mothur, is standard practice? Which would be best to use with the above situation?

Treat the original assignments of reads to OTUs as truth, and when resampling $n$ reads, just count the number of "original" OTUs observed in this sub-sample.
Re-cluster the sub-sampled reads, and then count the number of "new" OTUs in the sub-sample.

My sense from reading through the QIIME documentation is that method 1 is the standard, but I am not sure. I also do not quite understand why method 2 wouldn't be the better way to go, though it would be computationally more expensive.

perhaps related https://www.biostars.org/p/4147/ .. Colwell et al. 2012 — CKM, Dec 29 '15 at 20:26
The first one is the standard. Rarefaction curves (as they are) only make sense under invariant sets of OTUs. Since alignmnent distances are not metrics (that is they violate the triangle inequality) clusters, defined by an identity threshold, are not some geometrically stable areas in a sequence space, hence they are not invariant under reclustering given a smaller/larger set of sequences. — Eli Korvigo, Dec 30 '15 at 10:55
@ Eli, exactly, but if the point is to say "this is how many OTUs we would have identified if we had this many sequences", doesn't re-clustering make sense? Or is that not the point? — Ben S., Dec 30 '15 at 17:59
My experience is with 16S, but I can say in that context option 1 is standard. You could try running your pipeline on simulated data so that you can calculate error, and compare methods. — Galen, Dec 03 '16 at 22:59

score 3 · Answer 1 · answered Aug 05 '16 at 14:46

I actually work in the same group as Chris (QIIME author), so I hope this helps: An explanation he gave to us a while back about the basis of rarefaction curves is just to give an indication of whether your sampling is reaching saturated diversity, when comparing 2 unequal samples.

If you compare 2 samples, where sample x has less information (reads/amplicons etc) supporting it, you can't be sure you've sampled all the possible diversity.

Thus, you plot the number of OTUs you see for increasing numbers of reads. If you're saturating the diversity you should see it plateau.

To bring the 2 samples inline with the 'amount of supporting data' they have, you randomly sample the data of the larger dataset, so that it is equivalent to the smaller, and then compare the number of OTUs each is reporting.

Standard practice for generating rarefaction curves from Next Generation Sequencing data

1 Answers1