pigz | awk | wc is the fastest method
First off for benchmarks with FASTQ it's best to use a specific real-world example with a known answer. I've chosen this file:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG01815/sequence_read/ERR047740_1.filt.fastq.gz
as my test file, the correct answers being:
Number of reads: 67051220
Number of bases in reads: 6034609800
Next we want to find the fastest way possible to count these, all timings are the average wall-clock time (real) of 10 runs collected with the bash time on an otherwise unloaded system:
zgrep
zgrep . ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
This is the slowest method with an average run-time of 125.35 seconds
gzip awk
Using gzip we gain about another 10 seconds:
gzip -dc ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
Average run-time is 116.69 seconds
Konrad's gzip awk wc variant
fix_base_count() {
local counts=($(cat))
echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}
gzip -dc ERR047740_1.filt.fastq.gz \
| awk 'NR % 4 == 2' \
| wc -cl \
| fix_base_count
This runs slower on this test file than the gzip awk variant of the solution, average run-time is 122.28 seconds.
kseq_test using latest kseq.h from klib
Code compiled with: gcc -O2 -o kseq_test kseq_test.c -lz where kseq_test.c is Simon's adaptation of Heng Li's FASTQ parser.
kseq_test ERR047740_1.filt.fastq.gz
Average run-time is 99.14 seconds, which is better than the gzip core utilities based solution so far, but we can do better!
piz awk
Using Mark Adler's pigz as a drop-in replacement for gzip, note that pigz gives us a speed gain as on top of gzip as in addition to the main deflate thread it uses another 3 threads for reading, writing and checksum calculations, see the man page for details.
pigz -dc ERR047740_1.filt.fastq.gz |
awk 'NR%4==2{c++; l+=length($0)}
END{
print "Number of reads: "c;
print "Number of bases in reads: "l
}'
Average run-time is now 93.86 seconds, this is ~5 seconds faster than the kseq based C code but we can further improve the benchmark.
pigz awk wc
Next we use pigz as a drop in replacment for Konrad's wc variant of the awk based solution.
fix_base_count() {
local counts=($(cat))
echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}
gzip -dc ERR047740_1.filt.fastq.gz \
| awk 'NR % 4 == 2' \
| wc -cl \
| fix_base_count
Average run-time is now down to 83.03 seconds, this is ~16 seconds faster than the kseq based solution and ~42 seconds faster than the OPs zgrep based solution.
Next as a baseline lets see just how much of this run-time is due to decompression of the input fastq.gz file.
gzip alone
gzip -dc ERR047740_1.filt.fastq.gz > /dev/null
Average run-time: 105.95 seconds, so the gzip based solutions (which also includes zcat and zgrep as these are provided by gzip) are never going to be faster than kseq_test.
pigz alone
pigz -dc ERR047740_1.filt.fastq.gz > /dev/null
Average run-time: 77.66 seconds, so quite clearly the additional three threads for read, write and checksum calculation offer a useful advantage. What's more this speed-up is greater when leveraging the awk | wc based solution, it's not clear why, but I expect this is due to the extra write thread.
Interestingly average CPU usage across all threads is quite revealing for the various answers, I've collated these stats using GNU time /usr/bin/time --verbose
zgrep based solution 133% - must be more than one thread somehow
gzip | awk based solution 99% - all gzip based solutions run single-threaded at 99% CPU usage
pigz | awk 147%
gzip | awk | wc 99% as with gzip
pgiz | awk | wc 155%
kseq_test 99%
gzip > dev/null 99%
pigz > dev/null 155%
Whilst the main deflate thread in pigz will run at 100% CPU load the extra 3 don't quite fully occupy additional cores to 100% (as is evidenced by average CPU usage of ~150%) they do however clearly result in reduced run-time.
I'm using Ubuntu 16.04.2 LTS**, my gzip, zcat, zgrep versions are all gzip 1.6 and pigz is version 2.3.1. gcc is version 5.4.0
** I think my patch level is actually 16.04.4 but I've not rebooted for 170 days :p
zgrep .fulfils any tangible purpose. You should be able to leave it off entirely (replaced withzcat). – Konrad Rudolph Jun 28 '17 at 14:56zgrep .will only print non-blank lines (granted, that will still count lines that have nothing but whitespace since those aren't technically empty, but it's better than nothing). – terdon Jun 28 '17 at 15:05zcatwouldn't make all that much of a difference, I think. My main concern here is to find a more sophisticated tool that can deal with such issues robustly. – terdon Jun 28 '17 at 15:09pigzmanual, it will compress in parallel but not decompress: Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances. So maybe a little faster (testing it now) but I don't expect much difference for decompression. – terdon Jun 30 '17 at 11:39unpigzis actually slower thanzgrep .. – terdon Jun 30 '17 at 12:41In my testing
– Matt Bashton Jul 02 '17 at 21:56gzipis over 10 seconds faster thanzgrepBUTpigzis faster again ~30 seconds, it's also faster than the kseq using the latestklib.h. I'm not sure how you're getting that result, perhaps this only manifests on larger gzipped files. I've included the timings and test file in my answer.