views:

44

answers:

1

I have a textfile that has the format

 gene            complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
 CDS             complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
                 /codon_start=1
                 /transl_table=11
                 /product="putative serine/threonine phosphatase Ppp"
                 /protein_id="ABQ71738.1"
                 /db_xref="GI:148503929"
 gene            complement(24628..25095)
                 /locus_tag="MRA_0021"
 CDS             complement(24628..25095)
                 /locus_tag="MRA_0021"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71739.1"
                 /db_xref="GI:148503930"
 gene            complement(25219..26802)
                 /locus_tag="MRA_0022"
 CDS             complement(25219..26802)
                 /locus_tag="MRA_0022"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71740.1"
                 /db_xref="GI:148503931"

I would like to read the textfile into Matlab and make a list with the information from the line gene as the starting point for each item in the list. So for this example there will be 3 items in the list. I have tried a few things and cannot get this to work. Anyone have any ideas of what I can do?

+2  A: 

Here's a quick suggestion for an algorithm:

  1. Open the file with fopen
  2. Start reading lines with fgetl until you find a line that starts with 'CDS'.
  3. Keeep reading lines until you get another line that starts with 'gene'.
  4. For all the lines between the line in (2) and in (3)
    • find the string between '/' and '='. This is the fieldname
    • find the string between the quotes. This is the value of the field
  5. Up the counter by one, and start from #2 till you're done reading the file

These commands may be helpful:

  • To find a string enclosed by specific characters, use e.g. regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
  • To create the output structure, use dynamic field names, e.g. output(ct).(fieldname) = value;

EDIT

Here's some code. I saved your example as 'test.txt'.

% open file
fid = fopen('test.txt');

% parse the file
eof = false;
geneCt = 1;
clear output % you cannot reassign output if it exists with different fieldnames already
output(1:1000) = struct; % you may want to initialize fields here
while ~eof
    % read lines till we find one with CDS
    foundCDS = false;
    while ~foundCDS
        currentLine = fgetl(fid);
        % check for eof, then CDS. Allow whitespace at the beginning
        if currentLine == -1
            % end of file
            eof = true;
        elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once'))
            foundCDS = true;
        end
    end % looking for CDS

    if ~eof

        % read (and remember) lines till we find 'gene'
        collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below
        foundGene = false;
        lineCt = 1;
        while ~foundGene
            currentLine = fgetl(fid);
            % check for eof, then gene. Allow whitespace at the beginning
            if currentLine == -1;
                % end of file - consider all data has been read
                eof = true;
                foundGene = true;
            elseif ~isempty(regexp(currentLine,'^\s+gene','match','once'))
                foundGene = true;
            else
                collectedLines{lineCt} = currentLine;
                lineCt = lineCt + 1;
            end
        end

        % loop through collectedLines and assign. Do not loop through the
        % gene line
        for line = collectedLines(1:lineCt-1)
            fieldname = regexp(line{1},'/(.+)=','tokens','once');
            value = regexp(line{1},'="?([^"]+)"?$','tokens','once');
            % try converting value to number
            numValue = str2double(value);
            if isfinite(numValue)
                value = numValue;
            else
                value = value{1};
            end
            output(geneCt).(fieldname{1}) = value;
        end
        geneCt = geneCt + 1;
    end
end % while eof

% cleanup
fclose(fid);
output(geneCt:end) = [];
Jonas