ansaurus

Question

Is There A Fast GetToken Routine For Delphi?

Answer 1

+7 A:

Delphi compiles to VERY efficient code; in my experience, it was very difficult to do better in assembler.

I think you should just point a PChar (they still exist, don't they? I parted ways with Delphi around 4.0) at the beginning of the string and increment it while counting "|"s, until you've found n-1 of them. I suspect that will be faster than calling PosEx repeatedly.

Take note of that position, then increment the pointer some more until you hit the next pipe. Pull out your substring. Done.

I'm only guessing, but I wouldn't be surprised if this was close to the quickest this problem can be solved.

EDIT: Here's what I had in mind. This code is, alas, uncompiled and untested, but it should demonstrate what I meant.

In particular, Delim is treated as a single char, which I believe makes a world of difference if that will fulfill the requirements, and the character at PLine is tested only once. Finally, there's no more comparison against TokenNum; I believe it's faster to decrement a counter to 0 for counting delimiters.

function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
var 
  Del: Char;
  PLine, PStart: PChar;
  Nth, I, P0, P9: Integer;
begin
  Del := Delim[1];
  Nth = TokenNum + 1;
  P0 := 1
  P9 := Line.length + 1;
  PLine := PChar(line);
  for I := 1 to P9 do begin
    if PLine^ = Del then begin
      if Nth = 0 then begin
        P9 := I;
        break;
      end;
      Dec(Nth);
      if Nth = 0 then P0 := I + 1
    end;
    Inc(PLine);
  end;
  if (Nth <= 1) or (TokenNum = 1) then
    Result := Substring(Line, P0, P9 - P0);
  else
    Result := '' 
end;

Carl Smotricz 2009-11-07 19:08:40

But will that work with unicode?

dummzeuch 2009-11-07 21:08:43

I'm about 80% sure it will. After all, PChar is meant to still act as a pointer to a character. I suspect the Inc operator moves it by the width of a Unicode char.

Carl Smotricz 2009-11-07 22:41:54

Yes, it works fine with Unicode, as my above implementation clearly shows. But until some explanation is given, it looks like it is MUCH slower.

lkessler 2009-11-08 08:42:01

I updated my answer. End result: it depends. If your delimiter is likely to be more than one character, your original routine isn't too bad, though it may be possible to make faster using Boyer-Moore. If it *is* a single character, then PChar scanning will beat it. BTW, if speed is very important and your data source is not UTF-16, then you will be faster with AnsiString and PAnsiChar as well.

Barry Kelly 2009-11-08 14:47:33

Oops, that comment was in the wrong box.

Barry Kelly 2009-11-08 15:03:16

Carl: Your code has a lot of typos and line.length should be length(line) and Substring should be copy. But I tested it out, changing the Delim param to Char, and it came out at about .26 seconds, which is right up there. But I like your concept of changing my loop of tokennumbers to a character loop which has potential for improving the time. Don't know how much more we can squeeze out, though.

lkessler 2009-11-08 19:29:15

Sorry bout those typos and errors. I was too lazy to install a Delphi in a virtual machine. I hope I was able to help a bit, if only in agreeing with everybody else that your delimiter needs to be a char if it's only 1 long.

Carl Smotricz 2009-11-08 19:36:48

I appreciate your help, Carl. I have one more update to my question, and I've added an implementation of your code which is about 15% faster.

lkessler 2009-11-08 20:38:49

.19 seconds... woo hoo! I'm happy to hear this and I really appreciate your feedback. And yes, I think TokenNum should be an integer; bytes save space but for a single one the CPU needlessly diddles and masks to work with them. If possible, you should count it down so you can compare with 0.

Carl Smotricz 2009-11-08 22:08:17

Answer 2

+2 A:

Using assembler would be a micro-optimization. There are much greater gains to be had by optimizing the algorithm. Not doing work beats doing work in the fastest possible way, every time.

One example would be if you have places in your program where you need several tokens of the same line. Another procedure that returns an array of tokens which you can then index into should be faster than calling your function more than once, especially if you let the procedure not return all tokens, but only as many as you need.

But in general I agree with Carl's answer (+1), using a PChar for scanning would probably be faster than your current code.

mghie 2009-11-07 19:33:25

Absolutely, first I optimize the algorithm. I hope I've done most of that over the past 10 years. Now its time to squeeze a bit more blood out of the rock. But it's funny how looking at the micro level does give you insights into the macro level. I'm making all sorts of improvements now, just because I'm thinking about it all again.

lkessler 2009-11-08 06:34:02

Answer 3

+1 A:

This is a function that I've had in my personal library for quite some time that I use extensively. I believe this is the most current version of it. I've had multiple versions in the past being optimized for a variety of different reasons. This one tries to take into account Quoted strings, but if that code is removed it makes the function a slight bit faster.

I actually have a number of other routines, CountSections and ParseSectionPOS being a couple of examples.

Unfortuately this routine is ansi/pchar based only. Although I don't think it would be difficult to move it to unicode. Maybe I've already done that...I'll have to check on that.

Note: This routine is 1 based in the ParseNum indexing.

function ParseSection(ParseLine: string; ParseNum: Integer; ParseSep: Char; QuotedStrChar:char = #0) : string;
var
   wStart, wEnd : integer;
   wIndex : integer;
   wLen : integer;
   wQuotedString : boolean;
begin
   result := '';
   wQuotedString := false;
   if not (ParseLine = '') then
   begin
      wIndex := 1;
      wStart := 1;
      wEnd := 1;
      wLen := Length(ParseLine);
      while wEnd <= wLen do
      begin
         if (QuotedStrChar <> #0) and (ParseLine[wEnd] = QuotedStrChar) then
            wQuotedString := not wQuotedString;

         if not wQuotedString and (ParseLine[wEnd] = ParseSep) then
         begin
            if wIndex=ParseNum then
               break
            else
            begin
               inc(wIndex);
               wStart := wEnd+1;
            end;
         end;
         inc(wEnd);
      end;

      result := copy(ParseLine, wStart, wEnd-wStart);
      if (length(result) > 0) and (QuotedStrChar <> #0) and (result[1] = QuotedStrChar) then
         result := AnsiDequotedStr(result, QuotedStrChar);
   end;
end; { ParseSection }

Ryan J. Mills 2009-11-08 01:08:53

Thanks for the code. You'll be happy to know it worked fine in Delphi 2009 with Unicode strings. Timing (using the QueryPerformanceCounter described above with the 1,000,000 calls) was .74 seconds with the QuotedStrChar code left in. I took out that code and tried it again and that reduced it to .56 seconds. That's still slower than my original Pos/PosEX code which took .29 seconds.

lkessler 2009-11-08 08:58:10

Answer 4

+1 A:

In your code, I think this is the only line that can be optimized:

Result := copy(Line, P+1, MaxInt)

If you calculate the new Length there, it might get a bit faster, but not the 10% you are looking for.

Your tokenizing algorithm seems pretty OK. For optimizing it, I would run it through a profiler (like AQTime from AutomatedQA) with a representative subset of your production data. That will point you to the weakest spot.

The only RTL function that comes close is this one in the Classes unit:

procedure TStrings.SetDelimitedText(const Value: string);

It tokenizes, but uses both QuoteChar and Delimiter, but you only use a Delimiter.

It uses the SetString function in the System unit which is a pretty fast way to set the content of a string based on a PChar/PAnsiChar/PUnicodeChar and a length.

That might get you some improvement as well; on the other hand, Copy is really fast too.

Jeroen Pluimers 2009-11-08 03:21:21

Looking at your first point, I think you're wrong about the MaxInt. To calculate the length there, it is: length(Line) - P, and that subtraction is more expensive than using the constant MaxInt. Delphi doesn't care if the length to copy goes past the end of the string. It knows to stop when the string is done. I have used the "MaxInt" trick for a long time, after it was recommended somewhere - I don't remember. It saves me 5 seconds every time I code it. :-)

lkessler 2009-11-08 06:30:32

The TStrings.SetDelimitedText function is designed to add strings to a stringlist, as opposed to picking out one specific token. But it uses a similar technique to the supposedly optimal PChar method I described above. I've used the SetString as well, which is very fast. AQTime reported that 1.7 million calls to SetString took .05 seconds.

lkessler 2009-11-08 09:03:15

@lkessler: Actually, SetDelimitedText is to replace the content of the stringlist. But you got my point: it uses a very similar technique, but based on PChar (as Carl and Bary suggested), so it is worth looking at.Good you verified the MaxInt thing: I indicated it might be improved, but you measured that MaxInt is the best way to do it.I now browsed all the new comments and your question edits, and it seems you got it solved. Great! I like the way this stackoverflow community thing works very much.

Jeroen Pluimers 2009-11-09 07:28:53

Answer 5

+6 A:

It makes a big difference what "Delim" is expected to be. If it's expected to be a single character, you're far better off stepping through the string character by character, ideally through a PChar, and testing specifically.

If it's a long string, Boyer-Moore and similar searches have a set-up phase for skip tables, and the best way would be to build the tables once, and reuse them for each subsequent find. That means you need state between calls, and this function would be better off as a method on an object instead.

You might be interested in this answer I gave to a question some time before, about the fastest way to parse a line in Delphi. (But I see that it is you that asked the question! Nevertheless, in solving your problem, I would hew to how I described parsing, not using PosEx like you are using, depending on what Delim normally looks like.)

UPDATE: OK, I spent about 40 minutes looking at this. If you know the delimiter is going to be a character, you're pretty much always better off with the second version (i.e. PChar scanning), but you have to pass Delim as a character. At the time of writing, you're converting the PLine^ expression - of type Char - to a string for comparison with Delim. That will be very slow; even indexing into the string, with Delim[1] will also be somewhat slow.

However, depending on how large your lines are, and how many delimited pieces you want to pull out, you may be better off with a resumable approach, rather than skipping unwanted delimited pieces inside the tokenizing routine. If you call GetTok with successively increasing indexes, like you are currently doing in your mini benchmark, you'll end up with O(n*n) performance, where n is the number of delimited sections. That can be turned into O(n) if you save the state of the scan and restore it for the next iteration, or pack all extracted items into an array.

Here's a version that does all tokenization once, and returns an array. It needs to tokenize twice though, in order to know how large to make the array. On the other hand, only the second tokenization needs to extract the strings:

// Do all tokenization up front.
function GetTok4(const Line: string; const Delim: Char): TArray<string>;
var
  cp, start: PChar;
  count: Integer;
begin
  // Count sections
  count := 1;
  cp := PChar(Line);
  start := cp;
  while True do
  begin
    if cp^ <> #0 then
    begin
      if cp^ <> Delim then
        Inc(cp)
      else
      begin
        Inc(cp);
        Inc(count);
      end;
    end
    else
    begin
      Inc(count);
      Break;
    end;
  end;

  SetLength(Result, count);
  cp := start;
  count := 0;

  while True do
  begin
    if cp^ <> #0 then
    begin
      if cp^ <> Delim then
        Inc(cp)
      else
      begin
        SetString(Result[count], start, cp - start);
        Inc(cp);
        Inc(count);
      end;
    end
    else
    begin
      SetString(Result[count], start, cp - start);
      Break;
    end;
  end;
end;

Here's the resumable approach. The loads and stores of the current position and delimiter character do have a cost, though:

type
  TTokenizer = record
  private
    FSource: string;
    FCurrPos: PChar;
    FDelim: Char;
  public
    procedure Reset(const ASource: string; ADelim: Char); inline;
    function GetToken(out AResult: string): Boolean; inline;
  end;

procedure TTokenizer.Reset(const ASource: string; ADelim: Char);
begin
  FSource := ASource; // keep reference alive
  FCurrPos := PChar(FSource);
  FDelim := ADelim;
end;

function TTokenizer.GetToken(out AResult: string): Boolean;
var
  cp, start: PChar;
  delim: Char;
begin
  // copy members to locals for better optimization
  cp := FCurrPos;
  delim := FDelim;

  if cp^ = #0 then
  begin
    AResult := '';
    Exit(False);
  end;

  start := cp;
  while (cp^ <> #0) and (cp^ <> Delim) do
    Inc(cp);

  SetString(AResult, start, cp - start);
  if cp^ = Delim then
    Inc(cp);
  FCurrPos := cp;
  Result := True;
end;

Here's the full program I used for benchmarking.

Here are the results:

*** count=3, Length(src)=200
GetTok1: 595 ms
GetTok2: 547 ms
GetTok3: 2366 ms
GetTok4: 407 ms
GetTokBK: 226 ms
*** count=6, Length(src)=350
GetTok1: 1587 ms
GetTok2: 1502 ms
GetTok3: 6890 ms
GetTok4: 679 ms
GetTokBK: 334 ms
*** count=9, Length(src)=500
GetTok1: 3055 ms
GetTok2: 2912 ms
GetTok3: 13766 ms
GetTok4: 947 ms
GetTokBK: 446 ms
*** count=12, Length(src)=650
GetTok1: 4997 ms
GetTok2: 4803 ms
GetTok3: 23021 ms
GetTok4: 1213 ms
GetTokBK: 543 ms
*** count=15, Length(src)=800
GetTok1: 7417 ms
GetTok2: 7173 ms
GetTok3: 34644 ms
GetTok4: 1480 ms
GetTokBK: 653 ms

Depending on the characteristics of your data, whether the delimiter is likely to be a character or not, and how you work with it, different approaches may be faster.

(I made a mistake in my earlier program, I wasn't measuring the same operations for each style of routine. I updated the pastebin link and benchmark results.)

Barry Kelly 2009-11-08 03:54:15

Barry: Thanks for your response. See my "followup" in my question.

lkessler 2009-11-08 04:48:03

Nice... +1! Reason why GetTok3 is so slow we can find in fact that you have enabled switch in compiler options 'String Format Checking'. Switch off this switch and repeat the measurement!

GJ 2009-11-08 17:47:44

Barry: My final "best" code that loops by PChar instead of by Token is very similar to your up front tokenization. That may be optimal for this type of problem, and indicates a nice general methodology of sequential processing through strings for fast execution.

lkessler 2009-11-08 20:44:18

Answer 6

+4 A:

Your new function (the one with PChar) should declare "Delim" as Char and not as String. In your current implementation the compiler has to convert the PLine^ char into a string to compare it with "Delim". And that happens in a tight loop resulting is an enormous performance hit.

function GetTok(const Line: string; const Delim: Char{<<==}; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
{ LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; http://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
 I: integer;
 PLine, PStart: PChar;
begin
  PLine := PChar(Line);
  PStart := PLine;
  inc(PLine);
  for I := 1 to TokenNum do begin
    while (PLine^ <> #0) and (PLine^ <> Delim) do
      inc(PLine);
    if I = TokenNum then begin
      SetString(Result, PStart, PLine - PStart);
      break;
    end;
    if PLine^ = #0 then begin
      Result := '';
      break;
    end;
    inc(PLine);
    PStart := PLine;
  end;
end; { GetTok }

Andreas Hausladen 2009-11-08 09:43:14

You got it Andreas! Puzzle solved, and a warning to those passing strings when Chars will do.

lkessler 2009-11-08 17:52:06

Answer 7

+1 A:

I'm not the person always blaming the algorithm, but if I look at the first piece of source, the problem is that for string N, you do the POS/posexes for string 1..n-1 again too.

This means for N items, you do sum (n, n-1,n-2...1) POSes (=+/- 0.5*N^2) , while only N are needed.

If you simply cache the position of the last found result, e.g. in a record that is passed by VAR parameter, you can gain a lot.

type
TLastPosition = record elementnr : integer; // last tokennumber elementpos: integer; // character index of last match end;

and then something

if tokennum=(lastposition.elementnr+1) then begin newpos:=posex(delim,line,lastposition.elementpos); end;

Unfortunately, I don't have the time now to write it out, but I hope you get the idea

Marco van de Voort 2009-11-09 13:00:00

Well, the rewritten algorithm gets rid of the Pos and PosEXs completely. But your idea is good in terms of optimizing the original.

lkessler 2009-11-09 16:43:19

@lkessler: The point does hold for the rewritten algorithm as well, that's what I meant in my answer. If you get the first 5 tokens from the same string one after the other, then you will scan 5 times for the first, 4 times for the second, ... A different procedure which returns all 5 tokens in an array should be faster, if you take care how you return the results (no reallocation of the array). That's independent of whether you use `PosEx()`. For the rewritten algorithm you could return the token address and use it as the search start for the next function call.

mghie 2009-11-09 18:57:02

mghie: Yes. Good point. Best might be an implementation of GetFirstTok and GetNextTok for the cases where I need to get them sequentially.

lkessler 2009-12-21 04:44:44

ansaurus

tags:

views:

answers:

Is There A Fast GetToken Routine For Delphi?

related questions