views:

308

answers:

3

I have an interesting genetics problem that I would like to solve in native Python (nothing outside the standard library). This in order for the solution to be very easy to use on any computer, without requiring the user to install additional modules.

Here it is. I received 100,000s of DNA sequences (up to 2 billion) from a 454 new generation sequencing run. I want to trim the extremities in order to remove primers that may be present on both ends, both as normal and sense sequences. Example:

seq001: ACTGACGGATAGCTGACCTGATGATGGGTTGACCAGTGATC
        --primer-1---                 --primer-2-

Primers can be present one or multiple times (one right after the other). Normal sense are always on the left, and reverse on the right. My goal is thus to find the primers, cut the sequence such that only the primer-free part remains. For this, I want to use a classic alignment algorithm (ie: Smith-Waterman) that has been implemented in native Python (ie: not through biopython). I am aware of the fact that this may require quite some time (up to hours).

Note: This is NOT a direct "word" search, as DNA, both in the sequences and the primers, can be "mutated" for diverse technical reasons.

What would you use?

+1  A: 

Here's a paper on approximately that subject:

Rocke, On finding novel gapped motifs in DNA sequences, 1998.

Hopefully from that paper and its references, plus other papers which cite the above, you can find many ideas for algorithms. You won't find python code, but you may find descriptions of algorithms which you could then implement in Python.

Heath Hunnicutt
Thank you Heath. I'm really, however, looking for a Python implementation :) Cheers!
Morlock
+1  A: 

Researching that algorithm briefly, this is not easy stuff. This is going to take some very serious algorithm work. Try re-aligning your expectations from "hours" to "days or weeks".

The programmer implementing this will need:

  • High competence in general python programming
  • Algorithm programming experience, and a good understanding of time complexity.
  • A good understanding of python data structures such as dict, set, and deque, and their complexity characteristics.
  • Familiarity with unittests.

That programmer may or may not be you right now. This sounds like an awesome project, good luck!

Christian Oudard
@Christian Oudard The time I hinted at (hours) was referring to the time the algorithm might take, not how long it would take to create it :) From what I have found, I have rather decided to plunge deeper into the realm of using available (and quality) tools that already exist in the field of genetics. I give you the 'answer' since you finish nailing the nail I had half nailed myself while reflecting on the appropriatedness of reinventing the wheel here. Cheers!
Morlock
A: 

You could do this quite simply using regex? I don't think it would be that complicated! In fact, I have just completed some code to do something pretty much the same as this for one of the guys at the university here!

If not looking for exact copies of the primers, due to mutation then an element of fuzzy matching could be applied! The version I did very simply looked for exact primer matches at the start and end and returned the value minus those primers using the following code:

pattern = "^" + start_primer + "([A-Z]+)" + end_primer + "$" # start primer and end primer are sequences you are looking to match
regex = re.match(pattern, sequence) # sequence is the DNA sequence you are analyzing
print regex.group(1) # prints the sequence between the start and end primers

Here's a link on fuzzy regex in python http://hackerboss.com/approximate-regex-matching-in-python/

Steve