In my greedy search with pseudocounts algorithm in my bioinformatics course, I did not follow the pseudocode since I wanted to solve the problem in my own way. Unfortunately, although my answer is correct, the answer we are supposed to give is the first correct answer, not the last.
The problem works as such: Given t strings, you make a profile matrix from the first k-mer in the first string, and then you look at the next string and using that profile matrix, you find the best k-mer match, and then you combine that k-mer with the one from the first string to make a new profile matrix, etc. You eventually do this for all k-mer windows in your first string of DNA. The way I did it was by making a probability profile matrix from the best k-mer groups in each case, and then multiplying the greatest probabilities in each column. This gives me a score, and if the next score is higher (thus, more probable) that group of k-mers is used. Below is my code:
def highest_probability(string, k, matrix):
score = 1
temp_string = ""
c = 0
best_string = ""
temp = 0
while c+k < len(string):
for i in range(c, c+k):
if string[i] == 'A':
score *= matrix[i-c]['A']
temp_string += 'A'
elif string[i] == 'C':
score *= matrix[i-c]['C']
temp_string += 'C'
elif string[i] == 'G':
score *= matrix[i-c]['G']
temp_string += 'G'
elif string[i] == 'T':
score *= matrix[i-c]['T']
temp_string += 'T'
if i == c+k-1:
if score >= temp:
temp = score
best_string = temp_string
temp_string=""
score=1
c+=1
print(best_string)
return best_string
def GreedyMotifSearch(DNA, k, t):
best_motifs = [i[0:k] for i in DNA]
score=0
c=0
while c+k<len(DNA[0]):
k_mer = DNA[0][c:c+k]
motif=[k_mer]
matrix = make_matrix(k_mer, matrix=[], i=0)
for i in range(1, t):
compare_string = DNA[i]
good_motif = highest_probability(compare_string, k, matrix)
motif.append(good_motif)
matrix = make_matrix(good_motif, matrix, i)
temp_score=1
for dictionary in matrix:
temp_score *= max(dictionary.values())
print(temp_score)
if temp_score > score:
score = temp_score
best_motifs = motif
c+=1
return best_motifs
def make_matrix(string, matrix, i):
if len(matrix)==0:
for ch in string:
new_dict = {'A':0, 'C':0, 'G':0, 'T':0}
new_dict[ch] = 1
matrix += [new_dict]
print(matrix)
return matrix
elif len(matrix)==3:
for j in range(len(string)):
matrix[j][string[j]] += 1
Sum = [sum(matrix[0].values()),sum(matrix[1].values()),sum(matrix[2].values())]
for columns in range(len(matrix)):
for keys in matrix[columns]:
matrix[columns][keys] /= Sum[columns]
print(matrix)
return matrix
DNA=['GGCGTTCAGGCA',
'AAGAATCAGTCA',
'CAAGGAGTTCGC',
'CACGTCAATCAC',
'CAATAATATTCG']
k=3
t=5
print(GreedyMotifSearch(DNA, k, t))
The output gives me ['TTC', 'GTC', 'TTC', 'GTC', 'TTC'] which is equally correct to the answer given in the problem output which is TTC,ATC,TTC,ATC,TTC. Unfortunately, my answer is the last correct occurrence. I tried adjusting my highest_probability function with if score is greater than temp instead of greater or equal, but that gives me the answer ['GGC', '', '', '', ''] which I don't understand at all. My question is 1) Why is this the case? and 2) How can I fix this? And does anyone have a simpler way of doing this?