views:

65

answers:

2

I would like to read a (fairly big) log file into a matlab string cell in one step. I have used the usual:

s={};
fid = fopen('test.txt');
tline = fgetl(fid);
while ischar(tline)
   s=[s;tline];
   tline = fgetl(fid);
end

but this is just slow. I have found that

fid = fopen('test.txt');
x=fread(fid,'*char');

is way faster, but i get a nx1 char matrix x. i could try and convert x to a string cell, but then i get into char encoding hell; line delimiter seems to be \n\r, or 10 and 56 in ascii (ive looked at the end of the first line), but those two chars often dont follow each other and even show up solo sometimes.

so my question: is there an easy fast way to read an ascii file into a string cell in one step, or convert x to a string cell?

thank you.

edit:

reading via fgetl:

Code                           Calls        Total Time      % Time
tline = lower(fgetl(fid));     903113       14.907 s        61.2%

reading via fread:

>> tic;for i=1:length(files), fid = open(files(i).name);x=fread(fid,'*char*1');fclose(fid); end; toc
Elapsed time is 0.208614 seconds.

edit2:

i have tested preallocation, does not help :(

files=dir('.');
tic
for i=1:length(files),   
    if files(i).isdir || isempty(strfind(files(i).name,'.log')), continue; end
    %# preassign s to some large cell array
    sizS = 50000;
    s=cell(sizS,1);

    lineCt = 1;
    fid = fopen(files(i).name);
    tline = fgetl(fid);
    while ischar(tline)
       s{lineCt} = tline;
       lineCt = lineCt + 1;
       %# grow s if necessary
       if lineCt > sizS
           s = [s;cell(sizS,1)];
           sizS = sizS + sizS;
       end
       tline = fgetl(fid);
    end
    %# remove empty entries in s
    s(lineCt:end) = [];
end
toc

Elapsed time is 12.741492 seconds.

edit 3/solution:

roughly 10 times faster than the original:

s = textscan(fid,'%s','Delimiter','\n','whitespace','','bufsize',files(i).bytes);

had to set 'whitespace' to '' in order to keep the leading spaces (which i need for parsing), and 'bufsize' to the size of the file (the default 4000 threw a buffer overflow error).

A: 

Use the fgetl function instead of fread. For more info, go here

Raze2dust
i am using fgetl, however it is slow..
stephan hattinger
+1  A: 

The main reason your first example is slow is that s grows in every iteration. This means recreating a new array, copying the old lines, and adding the new one, which adds unnecessary overhead.

To speed up things, you can preassign s

%# preassign s to some large cell array
s=cell(10000,1);
sizS = 10000;
lineCt = 1;
fid = fopen('test.txt');
tline = fgetl(fid);
while ischar(tline)
   s{lineCt} = tline;
   lineCt = lineCt + 1;
   %# grow s if necessary
   if lineCt > sizS
       s = [s;cell(10000,1)];
       sizS = sizS + 10000;
   end
   tline = fgetl(fid);
end
%# remove empty entries in s
s(lineCt:end) = [];

Here's a little example of what preallocation can do for you

>> tic,for i=1:100000,c{i}=i;end,toc
Elapsed time is 10.513190 seconds.

>> d = cell(100000,1);
>> tic,for i=1:100000,d{i}=i;end,toc
Elapsed time is 0.046177 seconds.
>> 

EDIT

As an alternative to fgetl, you could use TEXTSCAN

fid = fopen('test.txt');
s = textscan(fid,'%s','Delimiter','\n');
s = s{1};

This reads the lines of test.txt as string into the cell array s in one go.

Jonas
I was about to give the same answer but there's something that I don't understand: The content of each cell of the cell array is undefined. Does pre-allocation help in this case?
Amaç Herdağdelen
@Amac: Yes, it does. See my edit.
Jonas
Great, thanks. Just to be sure, I replicated it with strings with varying lengths, and you still got a huge performance increase.
Amaç Herdağdelen
Thank you for your quick answer! i have coded the example to show the general problem, but did not think of the fact that probably not pre-allocating in the example would slow things down. however, in my case i instantly parse the lines, i.e. there is no string cell s. profiling leads to about 60% of time spent in the line "tline = fgetl(fid);" (with the other code being not optimized for now).
stephan hattinger
@stephan hattinger: What kind of parsing do you do? Could you use textscan, or fscanf to do the parsing right away?
Jonas
it is rather complicated and context sensitive parsing. most of the lines dont even interest me, its only a couple of about 200 lines long blocks per file. so what i do is: find the block entry token, read lines until block end token, and pass the string cell to a recursive parsing routine (a block is generally an indented print of a very nested object (with arrays too))
stephan hattinger
@stephan hattinger: You can use `textscan`. See my edit. I hope it's a bit faster than `fgetl`.
Jonas
than you very much, this solved my problem! read time went down to 1/10th compared to fgetl.
stephan hattinger