14

Is there any resource (paper, blogpost, Github gist, etc.) describing the BWA-MEM algorithm for assigning mapping qualities? I vaguely remember that I have somewhere seen a formula for SE reads, which looked like

$C * (s_1 - s_2) / s_1,$

where $s_1$ and $s_2$ denoted the alignment scores of two best alignments and C was some constant.

I believe that a reimplementation of this algorithm in some scripting language could be very useful for the bioinfo community. For instance, I sometimes test various mapping methods and some of them tend to find good alignments, but fail in assigning appropriate qualities. Therefore, I would like to re-assign all the mapping qualities in a SAM file with the BWA-MEM algorithm.

Btw. This algorithm must already have been implemented outside BWA, see the BWA-MEM paper:

GEM does not compute mapping quality. Its mapping quality is estimated with a BWA-like algorithm with suboptimal alignments available.

Unfortunately, the BWA-MEM paper repo contains only the resulting .eval files.

Update: The question is not about the algorithm for computing alignment scores. Mapping qualities and alignment scores are two different things:

  • Alignment score quantifies the similarity between two sequences (e.g., a read and a reference sequence)
  • Mapping quality (MAQ) quantifies the probability that a read is aligned to a wrong position.

Even alignments with high scores can have a very low mapping quality.

terdon
  • 10,071
  • 5
  • 22
  • 48
Karel Břinda
  • 1,909
  • 9
  • 19
  • 2
    Unfortunately I don’t know the answer for BWA-MEM (since it differs from BWA!) but pretty much all other tools are described here: https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ – Konrad Rudolph May 31 '17 at 13:06
  • Maybe this page can help http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores . Once you know the best and alternative positions where a read can align (or even the best and second best only?) it's not too difficult to implement I guess. – dariober Jun 01 '17 at 13:44

2 Answers2

3

Yes, there bwa-mem was published as a preprint

BWA-MEM’s seed extension differs from the standard seed extension in two aspects. Firstly, suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y.

...

Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence

And there is a description of the scoring algorithm directly in the source code of bwa-mem (lines 22 - 44), but maybe the only solution is really to go though the source code.

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
  • Thank you for your answer. However, the question is more about assigning mapping qualities. Even reads with a very high alignment score can have a mapping quality equal to zero. – Karel Břinda May 31 '17 at 00:47
  • 1
    Have you checked the source code? line 22 - 44. – Kamil S Jaron May 31 '17 at 05:33
  • 5
    @KamilSJaron Wow, that’s terribly hard to understand. The actual code unfortunately isn’t better. :-( At any rate, could you update your answer to include this more prominently? – Konrad Rudolph May 31 '17 at 13:01
  • @ KamilSJaron I did and it is still not completely clear for me even in the easier case of single-end reads. – Karel Břinda May 31 '17 at 13:15
  • Well I had no intention to actually explain the score (since I really do not know and the question explicitly asked about resources). I just knew about the preprint and I also got the idea to look at the source code, where I found those 22 lines of maths that seemed to explain the scoring. – Kamil S Jaron May 31 '17 at 19:44
1

The MAPQ score appears to be calculated in the mem_approx_mapq_se function. I tried porting it to Python and inserting comments using ChatGPT 4, though I haven't tested it to ensure the output is identical to the original.

import math

Constants

MEM_MAPQ_COEFFICIENT = 30.0 MEM_MAPQ_MAXIMUM = 60

def calculate_mapping_quality(alignment_options, alignment_region): # Determine the suboptimal alignment score, which is the second-best SW alignment score if available, # otherwise calculated from minimum seed length and match score. suboptimal_alignment_score = (alignment_region.suboptimal_score if alignment_region.suboptimal_score else alignment_options.minimum_seed_length * alignment_options.match_score)

# Update the suboptimal alignment score to the higher of the two: current suboptimal score or tandem hit score.
suboptimal_alignment_score = (alignment_region.tandem_hit_score if alignment_region.tandem_hit_score > suboptimal_alignment_score 
                              else suboptimal_alignment_score)

# Early exit with zero mapping quality if the suboptimal score is greater than or equal to the alignment score.
if suboptimal_alignment_score >= alignment_region.best_alignment_score:
    return 0

# Calculate the maximum alignment length between the query and reference sequences.
alignment_length = max(alignment_region.query_end - alignment_region.query_start, 
                       alignment_region.reference_end - alignment_region.reference_start)

# Calculate the identity of the alignment.
identity = (1.0 - (alignment_length * alignment_options.match_score - alignment_region.best_alignment_score) 
            / (alignment_options.match_score + alignment_options.mismatch_penalty) / alignment_length)

# Compute the mapping quality based on the alignment score, identity, and other factors.
if alignment_region.best_alignment_score == 0:
    mapping_quality = 0
elif alignment_options.mapQ_coef_length > 0:
    # Adjust mapping quality based on alignment length and identity.
    length_factor = (1.0 if alignment_length < alignment_options.mapQ_coef_length 
                     else alignment_options.mapQ_coef_fac / math.log(alignment_length))
    length_factor *= identity * identity
    mapping_quality = int(6.02 * (alignment_region.best_alignment_score - suboptimal_alignment_score) 
                          / alignment_options.match_score * length_factor * length_factor + 0.499)
else:
    # Use the constant MEM_MAPQ_COEFFICIENT for mapping quality calculation.
    mapping_quality = int(MEM_MAPQ_COEFFICIENT * (1.0 - suboptimal_alignment_score / alignment_region.best_alignment_score) 
                          * math.log(alignment_region.seed_coverage) + 0.499)
    mapping_quality = int(mapping_quality * identity * identity + 0.499) if identity < 0.95 else mapping_quality

# Penalize the mapping quality for the presence of suboptimal hits.
if alignment_region.suboptimal_hit_count > 0:
    mapping_quality -= int(4.343 * math.log(alignment_region.suboptimal_hit_count + 1) + 0.499)

# Ensure the mapping quality does not exceed the maximum allowed value.
mapping_quality = min(max(mapping_quality, 0), MEM_MAPQ_MAXIMUM)

# Adjust the mapping quality based on the fraction of repetitive regions in the alignment.
mapping_quality = int(mapping_quality * (1.0 - alignment_region.repetitive_region_fraction) + 0.499)

# Return the final mapping quality value.
return mapping_quality

```

Autodactyle
  • 111
  • 1