views:

89

answers:

1

I have a textfile that I am I want to make into a list. I have asked two questions recently about this topic. The problem I keep coming across is that the I want to parse the textfile but the sections are of different length. So I cannot use

textscan(fid,'%s %s %s')

because the length of each gene varies. I have also had trouble using fields because when I use the code to set up the fields it only allows for one line iin each field for the "note" field below in the first gene I would like to be able to multiple lines in one field an be able to read them in. currently I am getting errors about the index exceeds matrix dimensions.

fieldname = regexp(line{1},'/(.+)=','tokens','once');

value = regexp(line{1},'="?([^"]+)"?$','tokens','once');

Another possible way I see this working is using some sort of isLineEmpty to be able to divide up the genes be the empty line that is between them. Is there a way to be able to have multiple lines in my field entry so I can get all the information associated with "note" ? or a way to use an isLineEmpty and skip using fields?

 gene            218705..219367
                 /locus_tag="Rv0187"
                 /db_xref="GeneID:886779"
 CDS             218705..219367
                 /locus_tag="Rv0187"
                 /EC_number="2.1.1.-"
                 /function="THOUGHT TO BE INVOLVED IN TRANSFER OF METHYL
                 GROUP."
                 /note="Rv0187, (MTCI28.26), len: 220 aa. Probable
                 O-methyltransferase (EC 2.1.1.-), similar to many e.g.
                 AB93458.1|AL357591 putative O-methyltransferase from
                 Streptomyces coelicolor (223 aa); MDMC_STRMY|Q00719
                 O-methyltransferase from Streptomyces mycarofaciens (221
                 aa), FASTA scores: opt: 327, E(): 2.4e-17, (35.9% identity
                 in 192 aa overlap). Also similar to Rv1703c, Rv1220c from
                 Mycobacterium tuberculosis."
                 /codon_start=1
                 /transl_table=11
                 /product="O-methyltransferase"
                 /protein_id="NP_214701.1"
                 /db_xref="GI:15607328"
                 /db_xref="GeneID:886779"

 gene            219486..219917
                 /locus_tag="Rv0188"
                 /db_xref="GeneID:886776"
 CDS             219486..219917
                 /locus_tag="Rv0188"
                 /function="UNKNOWN"
                 /experiment="experimental evidence, no additional details
                 recorded"
                 /codon_start=1
                 /transl_table=11
                 /product="transmembrane protein"
                 /protein_id="NP_214702.1"
                 /db_xref="GI:15607329"
+2  A: 

I would probably consider using some sort of simple wrapper function to collapse the multi-line fields into a single line. Something like:

function l = readlongline( fh )
quotesSeen = 0;
done       = false;
l          = '';
while ~done
    tline = fgetl( fh );
    if ~ischar( tline )
        % Hit EOF
        l = tline;
        return
    end
    quotesSeen = quotesSeen + length( strfind( tline, '"' ) );
    % Break if we've seen 0 or 2 quotes
    done = any( quotesSeen == [0 2] );
    l = [l, tline];
end
end

This is intended to be a replacement for fgetl.

Edric

related questions