I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:
fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read() # yes! just read the whole dang 158Mb!
"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") +
"length=" + integer("genelen") + LineEnd() +
Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))
# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)
(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)
Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:
class CloseMatch(Token):
def __init__(self, seq, maxMismatches=1):
super(CloseMatch,self).__init__()
self.name = seq
self.sequence = seq
self.maxMismatches = maxMismatches
self.errmsg = "Expected " + self.sequence
self.mayIndexError = False
self.mayReturnEmpty = False
def parseImpl( self, instring, loc, doActions=True ):
start = loc
instrlen = len(instring)
maxloc = start + len(self.sequence)
if maxloc <= instrlen:
seq = self.sequence
seqloc = 0
mismatches = []
throwException = False
done = False
while loc < maxloc and not done:
if instring[loc] != seq[seqloc]:
mismatches.append(seqloc)
if len(mismatches) > self.maxMismatches:
throwException = True
done = True
loc += 1
seqloc += 1
else:
throwException = True
if throwException:
exc = self.myException
exc.loc = loc
exc.pstr = instring
raise exc
return loc, (instring[start:loc],mismatches)
For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)
I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):
searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for t,startLoc,endLoc in searchseq.scanString(g.gene):
matched, mismatches = t[0]
print "MATCH:", searchseq.sequence
print "FOUND:", matched
if mismatches:
print " ", ''.join(' ' if i not in mismatches else '*'
for i,c in enumerate(searchseq.sequence))
else:
print "<exact match>"
print "at location", startLoc
print
print
I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.
This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:
scf_1104442825154 (964)
------------------------
scf_1104442822828 (942)
------------------------
scf_1104442824510 (987)
------------------------
scf_1104442823180 (1065)
------------------------
...
I was getting discouraged, not to see any matches until:
scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
* *
at location 33
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 175
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 474
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 617
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
* *
at location 718
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
* *
at location 896
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
* *
at location 945
And finally my exact match at:
scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
* *
at location 177
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
*
at location 203
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 350
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
* *
at location 523
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 822
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
* *
at location 969
So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.
For comparison, here is an RE-based version, that finds 1-mismatch matches only:
import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
'|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:]
for i,c in enumerate(seqStr))
searchSeqRE = re.compile(searchSeqREStr)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for match in searchSeqRE.finditer(g.gene):
print "MATCH:", seqStr
print "FOUND:", match.group(0)
print "at location", match.start()
print
print
(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)
So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.