3

How to calculate the probability of finding two proteins that share a 5 amino acid long motif from a proteome of around 1067 proteins that have an average length of 65 residues. The probability of a specific sequence that is completely the same is 1067/20^65

gringer
  • 14,012
  • 5
  • 23
  • 79

1 Answers1

2

Its just significant at 2%

0.055 . 65 . 1067 = 0.021 (its actually 0.02 - see note)

Where critical probability is 0.05, i.e. 5%

There are 20 aa so 1/20 probability of random occurrence. The oligopeptide is 5 amino acids long, hence to the power 5. Obviously there are 1067 proteins and the probability is per protein, thus if these were independent its a straight multiplication because that is how independent events are measured.

Note The 65 is really 60 because it's a sliding window moving 1 amino acid at a time and the proteins are only 65 in length. The actual value is 60 rather than 65 because at the 3' terminal of the sequence the sliding window can't capture the remaining 4 amino acids, because once the oligopeptide < 5 the calculation is no longer valid. I used 65 so you could see where the value is originating.

Obviously the proteins in question are not independent so you need to state the assumption, i.e. assuming independence.

M__
  • 12,263
  • 5
  • 28
  • 47
  • 1
    There are 22 amino acids, not 20. Almost all animals (the only exceptions I know are a few insect species) have 21, because of selenocysteine, the 21st amino acid. I know less about pyrrolysine, the 22nd, but I believe it is found in some bacteria and archaea only. Given how few occurreneces of this amino acid most species have, I doubt it will change the analysis much, but there are 21 amino acids in most non-plant eukaryotes. – terdon Jan 02 '24 at 10:33
  • Thanks @terdon, now I know. Selenocysteine is interesting because of the importance of cysteine disulphide bonds. I assume selenocysteine doesn't form them. Anyway, the probability of random occurrence is 1/21 or 1/22 depending. – M__ Jan 02 '24 at 14:37
  • 2
    I... don't remember. Wow. I did my PhD on these things but it's been a while and I always focused on detecting them rather than on their biochemistry, but still! Anyway, Sec has selenium where Cys has sulfur, so presumably it cannot make disulphide bonds by definition, but I do seem to recall that the selenium allows for very similar bonding. We tended to treat Sec as a more reactive version of Cys, but my ignorance of biochemistry is deep so I don't really know. – terdon Jan 02 '24 at 16:12