views:

263

answers:

2

Hi all,

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.

Can you help me? Thanks a lot :-)

+4  A: 

Some of the emboss tools are a collection of small tools that can help you out.

To count number of fasta entries, I use: grep -c '^>' mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l

Yannick Wurm
+1  A: 

You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:

me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

real    0m3.710s
user    0m3.541s
sys     0m0.164s
brant.faircloth