When counting canonical kmers, ie kmers in which both the forward and reverse complement of a sequence are treated as identical, how do kmer counting programs decide which kmer to use as the canonical sequence? Do they all work the same way?
To investigate I made a string with GAGTGCGGAATACCACTCTT which contains all 16 possible 2mers. I then used kmc to figure out how they determine which kmer is used. Only the kmers in the filtered column below appeared. So, it looks like KMCs' 'canonical' kmers are the ones that first occur alphabetically.
╔════════════════╦═════╦════════════════════╦══════════╗
║ Possible Kmers ║ RCs ║ RC occurs earlier? ║ filtered ║
╠════════════════╬═════╬════════════════════╬══════════╣
║ TT ║ AA ║ YES ║ TA ║
║ TG ║ CA ║ YES ║ GC ║
║ TC ║ GA ║ YES ║ GA ║
║ TA ║ TA ║ ║ CG ║
║ GT ║ AC ║ YES ║ CC ║
║ GG ║ CC ║ YES ║ CA ║
║ GC ║ GC ║ ║ AT ║
║ GA ║ TC ║ ║ AG ║
║ CT ║ AG ║ YES ║ AC ║
║ CG ║ CG ║ ║ AA ║
║ CC ║ GG ║ ║ ║
║ CA ║ TG ║ ║ ║
║ AT ║ AT ║ ║ ║
║ AG ║ CT ║ ║ ║
║ AC ║ GT ║ ║ ║
║ AA ║ TT ║ ║ ║
╚════════════════╩═════╩════════════════════╩══════════╝
Do all kmer counting programs use the same canonical kmers, and if so do you have documentation explaining this? I wasn't able to find anything in the papers for jellyfish or kmc.