views:

426

answers:

9

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite) I am not sure how large the genome is but much more that 230,000 sequences.

I need to look for each of my sequences of 25 characters example(AGCCTCCCATGATTGAACAGATCAT). The genome is formatted as a continuous string ie (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT.........)

I don't care where or how many times it is found, just yes or no. This is simple I think, str.find(AGCCTCCCATGATTGAACAGATCAT)

But I also what to find a close match defined as wrong(mismatched) at any location but only 1 location and record the location in the sequnce. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. ie search 25 times. For example AGCCTCCCATGATTGAACAGATCAT AGCCTCCCATGATAGAACAGATCAT close match with a miss-match at position 13

Speed is not a big issue I am only doing it 3 times. i hope but it would be nice it was fast.

The are programs that do this find matches and partial matches but I am looking for a type of partial match that is not available with these applications.

Here is a similar post for pearl but they are only comparing sequnces not searching a continuous string

Related post

A: 

You could use Pythons built in capability to do the search with regular expression matching.

re module in python http://docs.python.org/library/re.html

regular expression primer http://www.regular-expressions.info/

Daniel
+3  A: 

You might find the various routines in Python-Levenshtein of some use.

Ignacio Vazquez-Abrams
The closest answer to his question is the function which calculates the Hamming distance between two strings. He needs an approximate **search** function, not an approximate **match** function.
John Machin
+1  A: 

This hints of the longest common subsequence problem. The problem with string similarity here is that you need to test against a continuous string of 230000 sequences; so if you are comparing one of your 25 sequences to the continuous string you'll get a very low similarity.

If you compute the longest common subsequence between your 25 sequences and the continuous string, you'll know if it is in the string if the lengths are the same.

Pierre-Antoine LaFayette
+2  A: 
>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))

>>> seq_re.findall(genome)  # list of matches
[]  

>>> seq_re.search(genome) # None if not found, otherwise a match object

This one stops a the first match, so may be a bit faster when there are multiple matches

>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found

>>> print "found" if seq_re.search(genome) else "not found" 
not found

>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found

for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.

You can always start implementing a more efficient algorithm while this one is running :)

If you should wish to search for single dropped or added elements change the regexp to this

>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))
gnibbler
genome is 80mega bases. So I will have lots of time to get a better algorithm. Is there a good/easy way to divide this up from within python?
Vincent
@Vincent, Probably worth asking this as a separate question. Whichever algorithm you use, it will be worth breaking up the task. How many cores/cpus do you have at your disposal?
gnibbler
@Vincent see this question http://stackoverflow.com/questions/2359253/solving-embarassingly-parallel-problems-using-python-multiprocessing
gnibbler
I saw a solution using fnmatch.fnmatch() Do you know how that seed might compare?
Vincent
+8  A: 

Before you read on, have you looked at biopython?

It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.

If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:

any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))

but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).

You can overcome (1) with a function like this:

def Hamming_check_0_or_1(genome, posn, sequence):
    errors = 0
    for i in xrange(25):
        if genome[posn+i] != sequence[i]:
            errors += 1
            if errors >= 2:
                return errors
    return errors 

Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.

Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.

When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.

Update: Working code for a bit-parallel method

I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.

The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.

Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.

Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.

# coding: ascii

from collections import defaultdict
import re

_DEBUG = 0


# "Fast Text Searching with Errors" by Sun Wu and Udi Manber
# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.
# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854

def WM_approx_Ham1_search(pattern, text):
    """Generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    S_table = defaultdict(int)
    for i, c in enumerate(pattern):
        S_table[c] |= 1 << i
    R0 = 0
    R1 = 0
    mask = 1 << (m - 1)
    for j, c in enumerate(text):
        S = S_table[c]
        shR0 = (R0 << 1) | 1
        R0 = shR0 & S
        R1 = ((R1 << 1) | 1) & S | shR0
        if _DEBUG:
            print "j= %2d msk=%s S=%s R0=%s R1=%s" \
                % tuple([j] + map(bitstr, [mask, S, R0, R1]))
        if R0 & mask: # exact match
            yield 0, j - m + 1
        elif R1 & mask: # match with one substitution
            yield 1, j - m + 1

if _DEBUG:

    def bitstr(num, mlen=8):
       wstr = ""
       for i in xrange(mlen):
          if num & 1:
             wstr = "1" + wstr
          else:
             wstr = "0" + wstr
          num >>= 1
       return wstr

def Ham_dist(s1, s2):
    """Calculate Hamming distance between 2 sequences."""
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

def long_check(pattern, text):
    """Naively and understandably generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    for i in xrange(len(text) - m + 1):
        d = Ham_dist(pattern, text[i:i+m])
        if d < 2:
            yield d, i

def Paul_McGuire_regex(pattern, text):
    searchSeqREStr = (
        '('
        + pattern
        + ')|('
        + ')|('.join(
            pattern[:i]
            + "[ACTGN]".replace(c,'')
            + pattern[i+1:]
            for i,c in enumerate(pattern)
            )
        + ')'
        )
    searchSeqRE = re.compile(searchSeqREStr)
    for match in searchSeqRE.finditer(text):
        locn = match.start()
        dist = int(bool(match.lastindex - 1))
        yield dist, locn


if __name__ == "__main__":

    genome1 = "TTTACGTAAACTAAACTGTAA"
    #         01234567890123456789012345
    #                   1         2

    tests = [
        (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),
        ("T" * 10, "TTTT"),
        ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match
        ]

    nfailed = 0
    for genome, patterns in tests:
        print "genome:", genome
        for pattern in patterns.split():
            print pattern
            a1 = list(WM_approx_Ham1_search(pattern, genome))
            a2 = list(long_check(pattern, genome))
            a3 = list(Paul_McGuire_regex(pattern, genome))
            print a1
            print a2
            print a3
            print a1 == a2, a2 == a3
            nfailed += (a1 != a2 or a2 != a3)
    print "***", nfailed
John Machin
This is worth a look at the very least to avoid that whole re-inventing the wheel thing!
jathanism
I postes this question on biopython. They suggested just using BLAST and parsing the results with python. As the results will include to many/broad of a match result.
Vincent
+1  A: 

You could make a trie out of all the different sequences that you want to match. Now is the tricky part of making a depth first search function down the trie that allows at most one mismatch.

The advantage of this method is that you are searching through all of the sequences at once. This will save you a lot of comparisons. For instance, when you start at the top node and go down the 'A' branch, you have just saved yourself many thousands of comparisons because have just instantly matched with all sequences that start with 'A'. For a different argument, consider a slice of the genome that matches exactly with a given sequence. If you have a different sequence in your list of sequences that differs only in the last symbol, then using a trie has just saved you 23 comparison operations.

Here is one way of implementing this. Convert 'A','C',T',G' to 0,1,2,3 or a variant of that. Then use tuples of tuples as your structure for your trie. At each node, the first element in your array corresponds with 'A', the second with 'C' and so on. If 'A' is a branch of this node, then there is another tuple of 4 elements as the first item of this node's tuple. If there isn't an 'A' branch, then set the first item to 0. At the bottom of the trie are nodes that have the id of that sequence so that it can be put into the list of matches.

Here are recursive search functions allowing one mismatch for this sort of trie:

def searchnomismatch(node,genome,i):
    if i == 25:
        addtomatches(node)
    else:
        for x in range(4):
            if node[x]:
                if x == genome[i]:
                    searchnomismatch(node[x],genome,i+1)
                else:
                    searchmismatch(node[x],genome,i+1,i)

def searchmismatch(node,genome,i,where):
    if i == 25:
        addtomatches(node,where)
    else:
        if node[genome[i]]:
            searchmismatch(node[genome[i]],genome,i+1,where)

Here, I start out the search with something like

searchnomismatch(trie,genome[ind:ind+25],0)

and addtomatches is something similar to

def addtomatches(id,where=-1):
    matches.append(id,where)

where equal to -1 means there wasn't a mismatch. Anyway, I hope that I was clear enough so that you get the picture.

Justin Peel
Funny, the first thing to occur to me was to make a trie out of all the length-25 substrings of the genome and match each candidate against it in the same way. That'd need way more memory. I think your approach could be optimized still further into a generalization of Aho-Corasick string matching.
Darius Bacon
@Darius Yes, I still think that a trie is a very good approach for this. I don't think that the memory would be too much, especially if a more memory efficient trie like a radix trie is used. I figured out that in the worst case 925 nodes (using a basic trie) will be visited to find all sequences matching a single slice of the genome. The average case is much less than this. That seems pretty fast to me. I'll have to look at the Aho-Corasick algorithm.
Justin Peel
Absolutely -- I meant *my* variant of the answer would have needed way more memory. I did upvote this, it's some silly StackOverflow bug that transfered my vote to some random other answer -- that's happened before. Let's hope it takes, this time.
Darius Bacon
The idea with Aho-Corasick, by the way, is to examine the genome byte by byte without backing up, instead of examining it one 25-byte frame at a time. You compile a state machine ahead of time by walking your tries as if you were matching the genome, only without actually looking at the genome yet -- recording tests and states instead.
Darius Bacon
(That'd be equivalent to gnibbler's answer, except Python's built-in regexp implementation apparently doesn't create a state machine, but backtracks instead, so it loses on this problem.)
Darius Bacon
+1  A: 

I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!

"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                "length=" + integer("genelen") + LineEnd() + 
                Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))

# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)

(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

class CloseMatch(Token):
    def __init__(self, seq, maxMismatches=1):
        super(CloseMatch,self).__init__()
        self.name = seq
        self.sequence = seq
        self.maxMismatches = maxMismatches
        self.errmsg = "Expected " + self.sequence
        self.mayIndexError = False
        self.mayReturnEmpty = False

    def parseImpl( self, instring, loc, doActions=True ):
        start = loc
        instrlen = len(instring)
        maxloc = start + len(self.sequence)

        if maxloc <= instrlen:
            seq = self.sequence
            seqloc = 0
            mismatches = []
            throwException = False
            done = False
            while loc < maxloc and not done:
                if instring[loc] != seq[seqloc]:
                    mismatches.append(seqloc)
                    if len(mismatches) > self.maxMismatches:
                        throwException = True
                        done = True
                loc += 1
                seqloc += 1
        else:
            throwException = True

        if throwException:
            exc = self.myException
            exc.loc = loc
            exc.pstr = instring
            raise exc

        return loc, (instring[start:loc],mismatches)

For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for t,startLoc,endLoc in searchseq.scanString(g.gene):
        matched, mismatches = t[0]
        print "MATCH:", searchseq.sequence
        print "FOUND:", matched
        if mismatches:
            print "      ", ''.join(' ' if i not in mismatches else '*' 
                            for i,c in enumerate(searchseq.sequence))
        else:
            print "<exact match>"
        print "at location", startLoc
        print
    print

I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

scf_1104442825154 (964)
------------------------

scf_1104442822828 (942)
------------------------

scf_1104442824510 (987)
------------------------

scf_1104442823180 (1065)
------------------------
...

I was getting discouraged, not to see any matches until:

scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
                *      *        
at location 33

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 175

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 474

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 617

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
                       *   *    
at location 718

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
                    *  *        
at location 896

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
                       *     *  
at location 945

And finally my exact match at:

scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
                    *  *        
at location 177

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
                       *        
at location 203

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 350

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
                       *       *
at location 523

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 822

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
          *         *           
at location 969

So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

For comparison, here is an RE-based version, that finds 1-mismatch matches only:

import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
    '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
             for i,c in enumerate(seqStr))

searchSeqRE = re.compile(searchSeqREStr)

for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for match in searchSeqRE.finditer(g.gene):
        print "MATCH:", seqStr
        print "FOUND:", match.group(0)
        print "at location", match.start()
        print
    print

(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

Paul McGuire
The N's are AGCT sequences according to the FASTA format page of wikipedia.
Justin Peel
Wow! You get the prize. I like the Hamming also but if am ever going to do more like this I would start with your code. Take a look at my answer to my own question. It is really amazing how fast nexalign is. Found every 1 mismatch from my file against the me49 gondii fasta file in 30 sec.
Vincent
Don't forget, the raw fasta file lists the sequences in 60-character rows, but the actual gene sequence is the concatenation of many such rows. A desired match might straddle a newline.
Paul McGuire
A: 

I ask this question on the biopython mailing list and got pointed to a great solution/program. @Michiel de Hoon Nexalign can do exactly what you are trying to do. See http://genome.gsc.riken.jp/osc/english/dataresource/.

It is amazing that is can find all 1 mismatch sequences from the 230,000 sequences I have in only 30sec.

While this is the solution to my problem to be fair it is not the answer to my stackoverflow question. So I will make one of the solutions as the correct solution. Thanks for all the great input.

Vincent
@Vincent: glad you found a solution to your real problem, but you didn't address this point: """Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case."""
John Machin
@John Machin Thanks, I say this was not the answer to my question because the solution does not use python. It is difficult to only 1 correct answer most especially when one ends up not using any of them. I have changes my selection to yours. If I take the advise of other it seems to be the way to go, and I should try it using the c library.
Vincent
A: 

I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want: http://bioinf.eva.mpg.de/patman/

cpp1