ansaurus

Question

Best way to read a FASTA file in c#

Answer 1

+4 A:

To do this one way is to:

Create a vector where each location holds a name and the sequence
Go through the file line by line
- If the line starts with > then add an element to the end of the vector and save the line.substring(1) to the element as the protein name. Initialize the sequence in the element to equal "".
- If the line.length == 0 then it is blank and do nothing
- Else the line doesn't start with > then it is part of the sequence so go current vector element.sequence += line. Thus way each line between >protein2 and >protein3 is concatenated and saved to the sequence of protein2

Kyra 2010-06-22 20:58:22

Answer 2

+2 A:

I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.

You could split the file on newline, and look for a > character to determine the name.

From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:

var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
    var line = reader.ReadLine();
    if(string.IsNullOrEmpty(line))
        break;
    if(line.StartsWith(">"))
        StoreProteinName(line);
    else
        StoreSequence(line);
}

If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.

ckramer 2010-06-22 21:04:00

Thank you guys. I appreciate your help

2010-06-22 21:16:47

Answer 3

A:

Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)

If you're interested in C or C++, check out the answers to this question over at Biostar: http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics

Do yourself a favor, and don't reinvent the wheel if you don't have to.

chrisamiller 2010-06-23 19:18:57

ansaurus

tags:

views:

answers:

Best way to read a FASTA file in c#

related questions