views:

658

answers:

6

We often replace non-desirable characters in a file with another "good" character.

The interface is:

procedure cleanfileASCII2(vfilename: string; vgood: integer; voutfilename: string);

To replace all non-desirables with a space we might call, cleanfileASCII2(original.txt, 32 , cleaned.txt)

The problem is that this takes a rather long time. Is there a better way to do it than shown?

procedure cleanfileASCII2(vfilename: string; vgood: integer; voutfilename:
string);
var
  F1, F2: file of char;
  Ch: Char;
  tempfilename: string;
  i,n,dex: integer;
begin
   //original
    AssignFile(F1, vfilename);
    Reset(F1);
    //outputfile
    AssignFile(F2,voutfilename);
    Rewrite(F2);
      while not Eof(F1) do
      begin
        Read(F1, Ch);
        //
          n:=ord(ch);
          if ((n<32)or(n>127))and (not(n in [10,13])) then
             begin // bad char
               if vgood<> -1 then
                begin
                ch:=chr(vgood);
                Write(F2, Ch);
                end
             end
           else   //good char
            Write(F2, Ch);
      end;
    CloseFile(F2);
    CloseFile(F1);
end;
+1  A: 

You could buffer your input and output so you read a chunk of characters (even the whole file, if it's not too big) into an array, then process the array, then write the entire array to the output file.

In most of these cases, the disk IO is the bottleneck, and if you can do fewer large reads instead of many small reads, it will be faster.

brien
A: 

I did it this way, ensuring that the file I/O is done all in one go before the processing. The code could do with updating for unicode but it copes with nasty text characters such as nulls and gives you a TStrings capability. Bri

procedure TextStringToStringsAA( AStrings : TStrings; const AStr: Ansistring);
// A better routine than the stream 'SetTextStr'.
// Nulls (#0) which might be in the file e.g. from corruption in log files
// do not terminate the reading process.
var
  P, Start, VeryEnd: PansiChar;
  S: ansistring;
begin
  AStrings.BeginUpdate;
  try
    AStrings.Clear;

    P := Pansichar( AStr );
    VeryEnd := P + Length( AStr );

    if P <> nil then
      while P < VeryEnd do
      begin
        Start := P;
        while (P < VeryEnd) and not CharInSet(P^, [#10, #13]) do
         Inc(P);
        SetString(S, Start, P - Start);
        AStrings.Add(string(S));
        if P^ = #13 then Inc(P);
        if P^ = #10 then Inc(P);
      end;
  finally
    AStrings.EndUpdate;
  end;
end;


procedure TextStreamToStrings( AStream : TStream; AStrings : TStrings );
// An alternative to AStream.LoadFromStream
// Nulls (#0) which might be in the file e.g. from corruption in log files
// do not terminate the reading process.
var
  Size : Integer;
  S    : Ansistring;
begin
  AStrings.BeginUpdate;
  try
    // Make a big string with all of the text
    Size := AStream.Size - AStream.Position;
    SetString( S, nil, Size );
    AStream.Read(Pointer(S)^, Size);

    // Parse it
    TextStringToStringsAA( AStrings, S );
  finally
    AStrings.EndUpdate;
  end;
end;

procedure LoadStringsFromFile( AStrings : TStrings; const AFileName : string );
// Loads this strings from a text file
// Nulls (#0) which might be in the file e.g. from corruption in log files
// do not terminate the reading process.
var
  ST : TFileStream;
begin
  ST := TFileStream.Create( AFileName, fmOpenRead + fmShareDenyNone);
  // No attempt is made to prevent other applications from reading from or writing to the file.
  try
    ST.Position := 0;
    AStrings.BeginUpdate;
    try
      TextStreamToStrings( ST, AStrings );
    finally
      AStrings.EndUpdate;
    end;

  finally
    ST.Free;
  end;
end;
Brian Frost
And if you replace the `not CharInSet` by `(P^ <> #10) and (P^ <> #13)` you will have a much faster loop. CharInSet is inlined but that doesn't change anything. It makes the compiler unable to generate optimal code.
Andreas Hausladen
Or 'not P^ in [#10, #13]', which is also very much faster.
Ken White
A: 

Don't try to optimize without know where.

You shoud use the Sampling Profiler (delphitools.info) to know where is the bottleneck. It's easy to use.

Precompute the vgood chr conversion, before the loop.

Also, You don't need some conversions: Ord() and Chr(). Use always the 'Ch' variable.

if not (ch in [#10, #13, #32..#127]) then
Junior-RO
If you follow your own advice you will probaly find that precomputing the vGood won't make much of a difference (-:
Henk Holterman
Only if the vgood parameter was a char :o)
Junior-RO
+1  A: 

Buffering is the correct way to do that. I modified your code to see the difference:

procedure cleanfileASCII2(vfilename: string; vgood: integer; voutfilename:
string);
var
  F1, F2: file;
  NumRead, NumWritten: Integer;
  Buf: array[1..2048] of Char;
  Ch: Char;
  i, n: integer;
begin
    AssignFile(F1, vfilename);
    Reset(F1, 1); // Record size = 1
    AssignFile(F2, voutfilename);
    Rewrite(F2, 1); // Record size = 1
    repeat
      BlockRead(F1, Buf, SizeOf(Buf), NumRead);
      for i := 1 to NumRead do
      begin
        Ch := Buf[i];
        //
        n := ord(ch);
        if ((n<32)or(n>127))and (not(n in [10,13])) then
        begin // bad char
         if vgood <> -1 then
         begin
           ch := chr(vgood);
           Buf[i] := Ch;
         end
        //else   //good char
         //Write(F2, Ch);
        end;
      end;
      BlockWrite(F2, Buf, NumRead, NumWritten);
    until (NumRead = 0) or (NumWritten <> NumRead);
    CloseFile(F1);
    CloseFile(F2);
end;
Nick D
+2  A: 

Several improvements:

  1. Buffer the data, read 2k or 16k or similar sized blocks
  2. Use a lookup table

here's a stab, that is untested (no compiler in front of me right now):

procedure cleanfileASCII2(vfilename: string; vgood: integer; voutfilename: string);
var
    f1, f2: File;
    table: array[Char] of Char;
    index, inBuffer: Integer;
    buffer: array[0..2047] of Char;
    c: Char;
begin
    for c := #0 to #31 do
        table[c] := ' ';
    for c := #32 to #127 do
        table[c] := c;
    for c := #128 to #255 do
        table[c] := ' ';
    table[#10] := #10; // exception to spaces <32
    table[#13] := #13; // exception to spaces <32

    AssignFile(F1, vfilename);
    Reset(F1, 1);
    AssignFile(F2,voutfilename);
    Rewrite(F2, 1);
    while not Eof(F1) do
    begin
        BlockRead(f1, buffer, SizeOf(buffer), inBuffer);
        for index := 0 to inBuffer - 1 do
          buffer[index] := table[buffer[index]];
        BlockWrite(f2, buffer, inBuffer);
    end;
    Close(f2);
    Close(f1);
end;
Lasse V. Karlsen
+1 for the buffering but I don't expect the lookup to make any significant difference.
Henk Holterman
+1 Henk. Also, for Lasse: You can change your table initialization to three lines and a single loop (no formatting available in comments): FillChar(table, sizeof(table), #32); for c := #32 to #127 do table[c] := c; table[#10] := #10; table[#13] := #13;
Ken White
+5  A: 

The problem has to do with how you're treating the buffer. Memory transfers are the most expensive part of any operation. In this case, you're looking at the file byte by byte. By changing to a blockread or buffered read, you will realize an enormous increase in speed. Note that the correct buffer size varies based on where you are reading from. For a networked file, you will find extremely large buffers may be less efficient due to the packet size TCP/IP imposes. Even this has become a bit murky with large packets from gigE but, as always, the best result is to benchmark it.

I converted from standard reads to a file stream just for convenience. You could easily do the same thing with a blockread. In this case, I took a 15MB file and ran it through your routine. It took 131,478ms to perform the operation on a local file. With the 1024 buffer, it took 258ms.

procedure cleanfileASCII3(vfilename: string; vgood: integer; voutfilename:string);
const bufsize=1023;
var
  inFS, outFS:TFileStream;
  buffer: array[0..bufsize] of byte;
  readSize:integer;
  tempfilename: string;
  i: integer;
begin
   if not FileExists(vFileName) then exit;

   inFS:=TFileStream.Create(vFileName,fmOpenRead);
   inFS.Position:=0;
   outFS:=TFileStream.Create(vOutFileName,fmCreate);
   while not (inFS.Position>=inFS.Size) do
      begin
      readSize:=inFS.Read(buffer,sizeof(buffer));
      for I := 0 to readSize-1 do
          begin
          n:=buffer[i];
          if ((n<32)or(n>127)) and (not(n in [10,13])) and (vgood<>-1) then
             buffer[i]:=vgood;
          end;
      outFS.Write(buffer,readSize);
      end;
   inFS.Free;
   outFS.Free;
end;
Marshall Fryman