views:

326

answers:

3

Hi,

I have sequences in fasta format that contains primers of 17 bp at the beginning of the sequences. And the primers sometimes have mismatches. I therefore want to remove the first 17 chars of the sequences, except from the fasta header.

The sequences look like this:

> name_name_number_etc
SEQUENCEFOLLOWSHERE
> name_number_etc
SEQUENCEFOLLOWSHERE
> name_name_number_etc
SEQUENCEFOLLOWSHERE

How can I do this in python?

Thanks! Jon

A: 

If your file looks like

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

and you want to remove the first 17 chars of every sequence line, you want to do something like this:

f = open('sequence.txt')

for line in f.xreadlines():
    if line.find('>') < 0:
        print line.strip()[17:]
`except from the fasta header` => so it would have to be re-arranged with something like `line = line[17:]` and the print/output to a file outside the `if`.
RedGlyph
That removes the first 17 chars from every line which is not a header, not only from the start of the sequence.
Stefano Borini
@Stefano: it fits the sample given in the OP description, we are not all supposed to know the specifications of amino-acid sequencing formats ;-)
RedGlyph
@RedGlyph : yes but it does not fit the answer's own case
Stefano Borini
A: 
with open('fasta_file') as f:
    for line in f:
        if not line.startswith('>'):
            print line[17:]
Tendayi Mawushe
Ok, I'm a Python noob. Why do I get this message when using this code? with open(test_input.fas) as f ^SyntaxError: invalid syntax
Jon
What version of python are you using? The with statement was new in Python 2.5 so needed to be enabled by putting the line: "from __future__ import with_statement" at the top of the module. In Python 2.6 this is enabled by default.
Tendayi Mawushe
Thanks. Found out I already had version 2.6 installed as well :)
Jon
no it's because the first line is missing the ending colon.
Stefano Borini
+1  A: 

If I understand correctly, you have to remove the primer only from the first 17 characters of a potentially multiline sequence. What you ask is a bit more difficult. Yes, a simple solution exists, but it can fail in some situations.

My suggestion is: use Biopython to perform the parsing of the FASTA file. Straight from the tutorial

from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)
handle.close()

Then rewrite the sequence down with the first 17 letters deleted. I don't have an installation of biopython on my current machine, but if you take a look at the tutorial, it won't take more than 15 lines of code in total.

If you want to go hardcore, and do it manually, you have to do something like this (from the first poster, modified)

f = open('sequence.fsa')

first_line = False
for line in f.xreadlines():
    if line[0] == ">":
        first_line=True
        print line,
    else:
        if first_line:
             print line[17:],
        else:
             print line,
        first_line = False
Stefano Borini
I like both the biopython suggestion and the code proposal. Biopython will work even if the sequence spans over several lines, contains whitespaces, etc
bgbg
Thanks! Works great!
Jon
Works great for cases according to the specification. In all other cases, it may fail.
Stefano Borini