views:

84

answers:

3

Hi there. I have a FASTA file containing several protein sequences. The format is like

----------------------
>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ

>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

>protein3
MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY`

-----------------------------------

Is there a good way to read in this file and store the sequences separately?

Thanks

+4  A: 

To do this one way is to:

  1. Create a vector where each location holds a name and the sequence
  2. Go through the file line by line

    • If the line starts with > then add an element to the end of the vector and save the line.substring(1) to the element as the protein name. Initialize the sequence in the element to equal "".
    • If the line.length == 0 then it is blank and do nothing
    • Else the line doesn't start with > then it is part of the sequence so go current vector element.sequence += line. Thus way each line between >protein2 and >protein3 is concatenated and saved to the sequence of protein2
Kyra
+2  A: 

I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.

You could split the file on newline, and look for a > character to determine the name.

From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:

var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
    var line = reader.ReadLine();
    if(string.IsNullOrEmpty(line))
        break;
    if(line.StartsWith(">"))
        StoreProteinName(line);
    else
        StoreSequence(line);
}

If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.

ckramer
Thank you guys. I appreciate your help
A: 

Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)

If you're interested in C or C++, check out the answers to this question over at Biostar: http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics

Do yourself a favor, and don't reinvent the wheel if you don't have to.

chrisamiller