views:

77

answers:

1

I have a file format (fastq format) that encodes a string of integers as a string where each integer is represented by an ascii code with an offset. Unfortunately, there are two encodings in common use, one with an offset of 33 and the other with an offset of 64. I typically have several 100 million strings of length 80-150 to convert from one offset to the other. The simplest code that I could come up with for doing this type of thing is:

def phred64ToStdqual(qualin):
    return(''.join([chr(ord(x)-31) for x in qualin]))

This works just fine, but it is not particularly fast. For 1 million strings, it takes about 4 seconds on my machine. If I change to using a couple of dicts to do the translation, I can get this down to about 2 seconds.

ctoi = {}
itoc = {}
for i in xrange(127):
    itoc[i]=chr(i)
    ctoi[chr(i)]=i

def phred64ToStdqual2(qualin):
    return(''.join([itoc[ctoi[x]-31] for x in qualin]))

If I blindly run under cython, I get it down to just under 1 second.
It seems like at the C-level, this is simply a cast to int, subtract, and then cast to char. I haven't written this up, but I'm guessing it is quite a bit faster. Any hints including how to better code a this in python or even a cython version to do this would be quite helpful.

Thanks,

Sean

+4  A: 

If you look at the code for urllib.quote, there is something that is similar to what you're doing. It looks like:

_map = {}
def phred64ToStdqual2(qualin):
    if not _map:
        for i in range(31, 127):
            _map[chr(i)] = chr(i - 31)
    return ''.join(map(_map.__getitem__, qualin))

Note that the above function works in case the mappings are not the same length (in urllib.quote, you have to take '%' -> '%25'.

But actually, since every translation is the same length, python has a function that does just this very quickly: maketrans and translate. You probably won't get much faster than:

import string
_trans = None
def phred64ToStdqual4(qualin):
    global _trans
    if not _trans:
        _trans = string.maketrans(''.join(chr(i) for i in range(31, 127)), ''.join(chr(i) for i in range(127 - 31)))
    return qualin.translate(_trans)
Mike Axiak
Thanks, Mike. That is a blazing fast 0.1 second on the same machine as above and will be fast enough for my purposes. I'll stick with phred64ToStdqual4() as listed above....
seandavi