Fast way to count number of reads and number of bases in a fastq file?

Question

I am looking for a tool, preferably written in C or C++, that can quickly and efficiently count the number of reads and the number of bases in a compressed fastq file. I am currently doing this using zgrep and awk:

zgrep . foo.fasq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

The zgrep . will print non-blank lines from the input file and the awk 'NR%4==2 will process every 4th line starting with the second (the sequence). This works fine, but can take a very long time when dealing with large files such as WGS data. Is there a tool I can use (on Linux) that will give me these values? Or, if not, I'm also open to suggestions for speeding up the above command.

I don’t think zgrep . fulfils any tangible purpose. You should be able to leave it off entirely (replaced with zcat). — Konrad Rudolph, Jun 28 '17 at 14:56
It might be worth noting that the FASTQ specification (such as it is) allows for line breaks in the sequence and qual strings, so simply taking the second of every group of 4 lines is not guaranteed to work. (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/#__sec7title) — sjcockell, Jun 28 '17 at 15:00
@KonradRudolph I use that for the unlikely case that a fastq file contains blank lines. The zgrep . will only print non-blank lines (granted, that will still count lines that have nothing but whitespace since those aren't technically empty, but it's better than nothing). — terdon, Jun 28 '17 at 15:05
I don’t think there’s a risk of that. The bigger risk is outlined by @sjcockwell (but depending on the provenance of the files even that can be ignored). — Konrad Rudolph, Jun 28 '17 at 15:06
@sjcockell yeah, I know and that's another reason why I'd rather use a more sophisticated, dedicated tool. However, I have yet to actually encounter a file with >4 lines per record in the wild, so I am not too worried about that. I should though, you're quite right. — terdon, Jun 28 '17 at 15:08
@KonradRudolph agreed, but it is a simple enough check to add and replacing it with zcat wouldn't make all that much of a difference, I think. My main concern here is to find a more sophisticated tool that can deal with such issues robustly. — terdon, Jun 28 '17 at 15:09
@terdon you're right that this is so rare it can be effectively ignored. I thought it was important to point out though, as so few people realise it's a possibility. — sjcockell, Jun 28 '17 at 15:10
Do you have access to a block-compressed file? It might be possible to use a parallel approach to this work to speed things up considerably. — Alex Reynolds, Jun 28 '17 at 17:29
@AlexReynolds I don't know. I'm afraid I don't know much about how compression works. The files I need to process are regular ASCII text files (fastq) compressed using gzip (usually GNU gzip, possibly BSD gzip sometimes if clients are using macs). — terdon, Jun 29 '17 at 08:42
Use pigz https://zlib.net/pigz/ to do the decompression you'll get a good speed up from that alone. — Matt Bashton, Jun 29 '17 at 11:10
@MatthewBashton I doubt that will make much difference. According to the pigz manual, it will compress in parallel but not decompress: Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances. So maybe a little faster (testing it now) but I don't expect much difference for decompression. — terdon, Jun 30 '17 at 11:39
@MatthewBashton it turns out (I tested on two different sets of reads) that unpigz is actually slower than zgrep .. — terdon, Jun 30 '17 at 12:41
@terdon I disagree, yes the deflate process is only single threaded, I knew this already, BUT the other 3 threads for read, write and checksum offer a noticeable speedup (especially for larger multi GB files which are the norm) over a single thread in gzip doing all the work.
In my testing gzip is over 10 seconds faster than zgrep BUT pigz is faster again ~30 seconds, it's also faster than the kseq using the latest klib.h. I'm not sure how you're getting that result, perhaps this only manifests on larger gzipped files. I've included the timings and test file in my answer. — Matt Bashton, Jul 02 '17 at 21:56

score 14 · Accepted Answer · edited Jun 29 '17 at 09:28

It's difficult to get this to go massively quicker I think - as with this question working with large gzipped FASTQ files is mostly IO-bound. We could instead focus on making sure we are getting the right answer.

People deride them too often, but this is where a well-written parser is worth it's weight in gold. Heng Li gives us this FASTQ Parser in C.

I downloaded the example tarball and modified the example code (excuse my C...):

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{
    gzFile fp;
    kseq_t *seq;
    int l;
    if (argc == 1) {
        fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
        return 1;
    }
    fp = gzopen(argv[1], "r");
    seq = kseq_init(fp);
    int seqcount = 0;
    long seqlen = 0;
    while ((l = kseq_read(seq)) >= 0) {
        seqcount = seqcount + 1;
        seqlen = seqlen + (long)strlen(seq->seq.s);
    }
    kseq_destroy(seq);
    gzclose(fp);
    printf("Number of sequences: %d\n", seqcount);
    printf("Number of bases in sequences: %ld\n", seqlen);
    return 0;
}

Then make and kseq_test foo.fastq.gz.

For my example file (~35m reads of ~75bp) this took:

real    0m49.670s
user    0m49.364s
sys     0m0.304s

Compared with your example:

real    0m43.616s
user    1m35.060s
sys     0m5.240s

Konrad's solution (in my hands):

real    0m39.682s
user    1m11.900s
sys     0m5.112s

(By the way, just zcat-ing the data file to /dev/null):

real    0m38.736s
user    0m38.356s
sys     0m0.308s

So, I get pretty close in speed, but am likely to be more standards compliant. Also this solution gives you more flexibility with what you can do with the data.

And my horrible C can almost certainly be optimised.

Same test, with kseq.h from Github, as suggested in the comments:

My machine is under different load this morning, so I've retested. Wall clock times:

OP: 0m44.813s

Konrad: 0m40.061s

zcat > /dev/null: 0m34.508s

kseq.h (Github): 0m32.909s

So most recent version of kseq.h is faster than simply zcat-ing the file (consistently in my tests...).

Wow. So the C approach is actually marginally slower than the zgrep/awk I used and significantly slower than Konrad's trick of switching to wc. That's surprising. I guess there must be some fancy parsing going on inside kseq_read. — terdon, Jun 28 '17 at 16:44
The kseq.h from the tarball is outdated. The one from github is faster. That said, decompression is the bottleneck as you have observed. — user172818, Jun 29 '17 at 00:17
@user172818 will that always be the case, irrespective of the size of the file? I would have thought that the decompression would be less of a bottleneck than the processing of each line when we're dealing with millions of lines. Is that wrong? — terdon, Jun 29 '17 at 08:08
I think with respect to speed of core utility base solutions, people often overlook just how optimised these programs are (which of course are themselves C binaries), they have been around for a long time and are not going away. — Matt Bashton, Jun 29 '17 at 11:15
Decompression will always be the bottleneck, it's the same situation we saw with https://bioinformatics.stackexchange.com/questions/361/what-is-the-fastest-way-to-calculate-the-number-of-unknown-nucleotides-in-fasta — Matt Bashton, Jun 29 '17 at 11:16

Konrad Rudolph · Answer 2 · 2021-12-17T12:06:15.390

7

The following is more than twice as fast; however, wc counts newline characters as well. We thus need to subtract the line count from the base count (using Bash):

fix_base_count() {
    local counts=($(cat))
    echo "${counts[0]} $((counts[1] - counts[0]))"
}
gunzip -c "$file" 

| awk 'NR % 4 == 2' 

| wc -cl 

| fix_base_count

All the caveats from Simon’s comment apply: this assumes the “simple” FASTQ format, where each record consists of exactly four lines. I think this is true for all files produced by Illumina sequencers and downstream tools.

edited Dec 17 '21 at 12:06

answered Jun 28 '17 at 15:22

Konrad Rudolph

4,845
14
45

1

What did you test this on? I used a 2.6G exome fastq file and this approach (without the bash function, so even faster) took 1m6.754s while my version using awk to count everything took 1m14.228s. What did you measure to see a two-fold increase in speed? – terdon Jun 29 '17 at 09:19
@terdon I’ve used the time command for benchmarking, and working on a 2.2GiB FASTQ file. The times on my machine were 9m5.628s and 3m5.825s real time, respectively. The user time in both cases was slightly elevated (11m vs 5m) to make the difference a factor 2. – Konrad Rudolph Jun 29 '17 at 10:01
@terdon But given Simon’s answer and common sense, this is a fishy result indeed. I don’t know what my awk version is doing here. – Konrad Rudolph Jun 29 '17 at 10:07

Matt Bashton · Answer 3 · 2017-07-02T22:46:28.427

pigz | awk | wc is the fastest method

First off for benchmarks with FASTQ it's best to use a specific real-world example with a known answer. I've chosen this file:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG01815/sequence_read/ERR047740_1.filt.fastq.gz

as my test file, the correct answers being:

Number of reads: 67051220
Number of bases in reads: 6034609800

Next we want to find the fastest way possible to count these, all timings are the average wall-clock time (real) of 10 runs collected with the bash time on an otherwise unloaded system:

zgrep

zgrep . ERR047740_1.filt.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

This is the slowest method with an average run-time of 125.35 seconds

gzip awk

Using gzip we gain about another 10 seconds:

gzip -dc ERR047740_1.filt.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

Average run-time is 116.69 seconds

Konrad's gzip awk wc variant

fix_base_count() {
    local counts=($(cat))
    echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}

gzip -dc ERR047740_1.filt.fastq.gz \
    | awk 'NR % 4 == 2' \
    | wc -cl \
    | fix_base_count

This runs slower on this test file than the gzip awk variant of the solution, average run-time is 122.28 seconds.

kseq_test using latest `kseq.h` from klib

Code compiled with: gcc -O2 -o kseq_test kseq_test.c -lz where kseq_test.c is Simon's adaptation of Heng Li's FASTQ parser.

kseq_test ERR047740_1.filt.fastq.gz

Average run-time is 99.14 seconds, which is better than the gzip core utilities based solution so far, but we can do better!

piz awk

Using Mark Adler's pigz as a drop-in replacement for gzip, note that pigz gives us a speed gain as on top of gzip as in addition to the main deflate thread it uses another 3 threads for reading, writing and checksum calculations, see the man page for details.

pigz -dc ERR047740_1.filt.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

Average run-time is now 93.86 seconds, this is ~5 seconds faster than the kseq based C code but we can further improve the benchmark.

pigz awk wc

Next we use pigz as a drop in replacment for Konrad's wc variant of the awk based solution.

fix_base_count() {
    local counts=($(cat))
    echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
}

gzip -dc ERR047740_1.filt.fastq.gz \
    | awk 'NR % 4 == 2' \
    | wc -cl \
    | fix_base_count

Average run-time is now down to 83.03 seconds, this is ~16 seconds faster than the kseq based solution and ~42 seconds faster than the OPs zgrep based solution.

Next as a baseline lets see just how much of this run-time is due to decompression of the input fastq.gz file.

gzip alone

gzip -dc ERR047740_1.filt.fastq.gz > /dev/null

Average run-time: 105.95 seconds, so the gzip based solutions (which also includes zcat and zgrep as these are provided by gzip) are never going to be faster than kseq_test.

pigz alone

pigz -dc ERR047740_1.filt.fastq.gz > /dev/null

Average run-time: 77.66 seconds, so quite clearly the additional three threads for read, write and checksum calculation offer a useful advantage. What's more this speed-up is greater when leveraging the awk | wc based solution, it's not clear why, but I expect this is due to the extra write thread.

Interestingly average CPU usage across all threads is quite revealing for the various answers, I've collated these stats using GNU time /usr/bin/time --verbose

zgrep based solution 133% - must be more than one thread somehow

gzip | awk based solution 99% - all gzip based solutions run single-threaded at 99% CPU usage

pigz | awk 147%

gzip | awk | wc 99% as with gzip

pgiz | awk | wc 155%

kseq_test 99%

gzip > dev/null 99%

pigz > dev/null 155%

Whilst the main deflate thread in pigz will run at 100% CPU load the extra 3 don't quite fully occupy additional cores to 100% (as is evidenced by average CPU usage of ~150%) they do however clearly result in reduced run-time.

I'm using Ubuntu 16.04.2 LTS**, my gzip, zcat, zgrep versions are all gzip 1.6 and pigz is version 2.3.1. gcc is version 5.4.0

** I think my patch level is actually 16.04.4 but I've not rebooted for 170 days :p

Nice and thorough! Thank you. I don't know why your tests showed pigz to be faster when it was slower in mine. I suspect it might be because I was testing using files stored on an NFS volume so my times could well have depended more on the network latency than anything else. I'll repeat your tests using a local drive and report back. — terdon, Jul 03 '17 at 08:08
OK, confirmed. On your test file and on a local drive, pigz was indeed faster than gzip or zcat | grep. — terdon, Jul 03 '17 at 10:28

score 5 · Answer 4 · answered Jun 28 '17 at 20:00

I get fairly quick results with my fastx-length.pl script, with the added bonus of being able to handle multi-line FASTQ files and displaying additional read-length QC statistics:

time zcat albacored_all.fastq.gz | /bioinf/scripts/fastx-length.pl > /dev/null
Total sequences: 301135
Total length: 283.902419 Mb
Longest sequence: 5.601 kb
Shortest sequence: 6 b
Mean Length: 942 b
Median Length: 999 b
N50: 111835 sequences; L50: 1.103 kb
N90: 245243 sequences; L90: 608 b

real    0m8,802s
user    0m16,584s
sys 0m0,260s

Versus the script you have provided:

zcat albacored_all.fastq.gz | awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'
Number of reads: 301135
Number of bases in reads: 283902419

real    0m8,382s
user    0m10,216s
sys 0m0,332s

Cat to /dev/null for comparison:

time zcat albacored_all.fastq.gz > /dev/null

real    0m7,877s
user    0m7,856s
sys 0m0,020s

I suspect that something using bioawk might be a bit faster (and similarly FASTQ-compliant).

bioawk is not that efficient, at least in my tests: https://bioinformatics.stackexchange.com/a/961/292 — bli, Jun 30 '17 at 13:52

Lei ZHANG · Answer 5 · 2017-06-29T09:33:47.703

I hava implemented seqtk_counts using kseq.h from klib

Just a few line of Codes:

#include <stdio.h>
#include <zlib.h>
#include "kseq.h"

KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[]){

gzFile fp;
kseq_t *seq;

int l = 0;

int64_t total = 0;
int64_t lines = 0;

if (argc == 1) {
    fprintf(stderr, "Usage: %s <fastq> <sample>\n", argv[0]);

    return 1;
}

fp = strcmp(argv[1], "-")? gzopen(argv[1], "r") : gzdopen(fileno(stdin), "r");
seq = kseq_init(fp);

while ((l = kseq_read(seq)) >= 0){
    total += seq->seq.l;
    lines += 1;
}

printf("%s\t%lld\t%lld\n", argv[2] ,(long long)lines, (long long)total);
kseq_destroy(seq);
gzclose(fp);
return 0;

}

Compile it:

gcc  -O2  seqtk_counts.c  -o  seqtk_counts  -Iklib  -lz

Usage:

seqtk_counts foo.fasq.gz foo 
or
cat foo.fasq.gz | seqtk_counts  - foo

Thanks for the edit! :) But isn't this basically the same program as shown in sjcockell's answer? — terdon, Jun 29 '17 at 09:45

winanonanona · Answer 6 · 2021-05-18T10:10:12.057

1

Use assembly-stats!

$ assembly-stats barcode13_filtered.fastq 
stats for barcode13_filtered.fastq
sum = 2080834976, n = 656192, ave = 3171.08, largest = 15321
N50 = 3598, n = 225863
N60 = 3263, n = 286569
N70 = 2920, n = 353904
N80 = 2548, n = 429982
N90 = 2081, n = 519671
N100 = 500, n = 656192
N_count = 0
Gaps = 0
real    0m7.418s
user    0m5.976s
sys     0m1.275s

Works on any fasta/fastq files, even those that aren't assemblies. And it's pretty fast!

Download from here: https://anaconda.org/bioconda/assembly-stats

edited May 18 '21 at 10:10

answered May 18 '21 at 08:20

winanonanona

111
3

Thanks, that looks interesting. How does the time compare to the existing answers here? Is this faster? And what are the numbers reported? Is that 656192 reads? And 2080834976 bases? – terdon May 18 '21 at 09:03
Yes, yes, and yes! 656192 reads, 2080834976 bases, with the respective N50 etc values. And compared to the ones above, it is lightning fast. real 0m7.418s user 0m5.976s – winanonanona May 18 '21 at 10:07
Could you show an example of that? You are running this on a single, decompressed file here, while the question is asking about compressed files and the other answers have shown that the rate limiting step is usually the decompression. So how fast is this approach when run on a compressed file and taking the decompression time into account? Also, in order for the reported times to be meaningful, please also include the time taken by one of the other methods (you can't compare times across different machines and files). – terdon May 18 '21 at 10:11
This should too good to be true, is it because the fastq is uncompressed? @sjcockell benchmark is on a comparable ~2.6e9 bases file, but it's about 5x slower... But it's a compressed file that was used. – Kamil S Jaron May 19 '21 at 11:51

score 1 · Answer 7 · answered Jul 26 '17 at 17:47

if the data is in SRA, there is sra-stat utility that returns reads,bases and quality distribution. these are stored in the SRA file. use --quick to get the stored stats or --statistics to calculate additional values broken down per readgroup/barcode. sra-stat --quick SRR077487

score 1 · Answer 8 · answered Jun 30 '17 at 13:50

Using pyGATB

(I use the same file as in https://bioinformatics.stackexchange.com/a/400/292, same workstation as in https://bioinformatics.stackexchange.com/a/380/292)

$ time python3 -c "from gatb import Bank; seq_lens = [len(seq) for seq in Bank('SRR077487_2.filt.fastq.gz')]; print('Number of reads: %d' % len(seq_lens), 'Number of bases in reads: %d' % sum(seq_lens), sep='\n')"
Number of reads: 23861612
Number of bases in reads: 2386161200

real    0m41.122s
user    0m40.788s
sys     0m0.312s

It is quite faster than bioawk:

$ time bioawk -c fastx '{nb_seq+=1; nb_char+=length($seq)} END {print "Number of reads: "nb_seq"\nNumber of bases in reads: "nb_char}' SRR077487_2.filt.fastq.gz
Number of reads: 23861612
Number of bases in reads: 2386161200

real    1m3.182s
user    1m2.916s
sys     0m0.268s

But not so much than the OP example:

$ time zgrep . SRR077487_2.filt.fastq.gz | awk 'NR%4==2{c++; l+=length($0)} END{print "Number of reads: "c; print "Number of bases in reads: "l}'
Number of reads: 23861612
Number of bases in reads: 2386161200

real    0m47.127s
user    1m36.292s
sys     0m6.796s

Or than the wc based solution:

$ fix_base_count() {
>     local counts=($(cat))
>     echo "${counts[0]} $((${counts[1]} - ${counts[0]}))"
> }
$ time gunzip -c SRR077487_2.filt.fastq.gz | awk 'NR % 4 == 2' | wc -cl | fix_base_count
23861612 2386161200

real    0m44.915s
user    1m12.000s
sys     0m6.972s

I didn't compare with C-based solutions.

The zcat to /dev/null reference is the following:

$ time zcat SRR077487_2.filt.fastq.gz > /dev/null

real    0m39.745s
user    0m39.464s
sys     0m0.252s

I'm still impressed by pyGATB speed