Greedy Motif Search Using Probability Matrices

Question

In my greedy search with pseudocounts algorithm in my bioinformatics course, I did not follow the pseudocode since I wanted to solve the problem in my own way. Unfortunately, although my answer is correct, the answer we are supposed to give is the first correct answer, not the last.

The problem works as such: Given t strings, you make a profile matrix from the first k-mer in the first string, and then you look at the next string and using that profile matrix, you find the best k-mer match, and then you combine that k-mer with the one from the first string to make a new profile matrix, etc. You eventually do this for all k-mer windows in your first string of DNA. The way I did it was by making a probability profile matrix from the best k-mer groups in each case, and then multiplying the greatest probabilities in each column. This gives me a score, and if the next score is higher (thus, more probable) that group of k-mers is used. Below is my code:

 def highest_probability(string, k, matrix):
    score = 1
    temp_string = ""
    c = 0
    best_string = ""
    temp = 0
    while c+k < len(string):
        for i in range(c, c+k):
            if string[i] == 'A':
                score *= matrix[i-c]['A']
                temp_string += 'A'
            elif string[i] == 'C':
                score *= matrix[i-c]['C']
                temp_string += 'C'
            elif string[i] == 'G':
                score *= matrix[i-c]['G']
                temp_string += 'G'
            elif string[i] == 'T':
                score *= matrix[i-c]['T']
                temp_string += 'T'
            if i == c+k-1:
                if score >= temp:
                    temp = score
                    best_string = temp_string
        temp_string=""
        score=1
        c+=1
    print(best_string)
    return best_string


def GreedyMotifSearch(DNA, k, t):
    best_motifs = [i[0:k] for i in DNA]
    score=0
    c=0
    while c+k<len(DNA[0]):
        k_mer = DNA[0][c:c+k]
        motif=[k_mer]
        matrix = make_matrix(k_mer, matrix=[], i=0)
        for i in range(1, t):
            compare_string = DNA[i]
            good_motif = highest_probability(compare_string, k, matrix)
            motif.append(good_motif)
            matrix = make_matrix(good_motif, matrix, i)
        temp_score=1
        for dictionary in matrix:
            temp_score *= max(dictionary.values())
        print(temp_score)
        if temp_score > score:
            score = temp_score
            best_motifs = motif
        c+=1
    return best_motifs



def make_matrix(string, matrix, i):
    if len(matrix)==0:
        for ch in string:
            new_dict = {'A':0, 'C':0, 'G':0, 'T':0}
            new_dict[ch] = 1
            matrix += [new_dict]
        print(matrix)
        return matrix
    elif len(matrix)==3:
        for j in range(len(string)):
            matrix[j][string[j]] += 1
        Sum = [sum(matrix[0].values()),sum(matrix[1].values()),sum(matrix[2].values())]
        for columns in range(len(matrix)):
            for keys in matrix[columns]:
                matrix[columns][keys] /= Sum[columns]
        print(matrix)
        return matrix 

DNA=['GGCGTTCAGGCA',
'AAGAATCAGTCA',
'CAAGGAGTTCGC',
'CACGTCAATCAC',
'CAATAATATTCG']
k=3
t=5
print(GreedyMotifSearch(DNA, k, t))

The output gives me ['TTC', 'GTC', 'TTC', 'GTC', 'TTC'] which is equally correct to the answer given in the problem output which is TTC,ATC,TTC,ATC,TTC. Unfortunately, my answer is the last correct occurrence. I tried adjusting my highest_probability function with if score is greater than temp instead of greater or equal, but that gives me the answer ['GGC', '', '', '', ''] which I don't understand at all. My question is 1) Why is this the case? and 2) How can I fix this? And does anyone have a simpler way of doing this?

score 1 · Answer 1 · answered Aug 27 '17 at 02:19

Assuming you're talking about Implement GreedyMotifSearch from Rosalind, while I haven't actually finished this particular one, I've got an idea why TTC,ATC,TTC,ATC,TTC doesn't work. Here's my explanation from the "Questions" section of that problem:

The profile is created from the first kmers in each string, and the most likely string from those is CAG. The answers will generally be similar to this.

I've been trying to do Rosalind by only using the information provided in the questions; this is one of the many problems where there wasn't enough information in the Question for me to arrive at a correct solution when I tried it. It's possible that the debug datasets were added to this question since I last looked at it, as they seem to describe this situation:

TEST DATASET 1:

Input:

3 4
GCCCAA
GGCCTG
AACCTA
TTCCTT

Output:

GCC
GCC
AAC
TTC

This dataset checks that your code always picks the first-occurring Profile-most Probable k-mer in a given sequence of Dna. In the first sequence (“GCCCAA”), “GCC” and “CCA” are both Profile-most Probable k-mers. However, you must return “GCC” since it occurs earlier than “CCA”. Thus, if the first sequence of your output is “CCA”, this test case fails your code.

Greedy Motif Search Using Probability Matrices

1 Answers1

TEST DATASET 1:

Input:

Output:

Linked