views:

296

answers:

3

It is common knowledge that SameStr(S1, S2) is faster than S1 = S2, where var S1, S2: string in Delphi.

(And, of course, SameText(S1, S2) is much faster than AnsiLowerCase(S1) = AnsiLowerCase(S2).)

But, as far as I understand it, SameStr(S1, S2) does exactly the same thing as S1 = S2, so I cannot help but wonder why in the world the Delphi compiler doesn't use the SameStr code when it test for string equality using the = operator. Surely there must be a reason for this?

Some Benchmarking

A trivial program,

program Project1;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  RejbrandCommon;

const
  N = 1000000;

var
  Strings1, Strings2: StringArray;
  i: integer;
  b: {dummy }boolean;

procedure CreateRandomStringArrays;
var
  i: integer;
begin
  SetLength(Strings1, N);
  SetLength(Strings2, N);
  for i := 0 to N - 1 do
  begin
    Strings1[i] := RandomString(0, 40);
    Strings2[i] := RandomString(0, 40);
  end;
end;

begin

  CreateRandomStringArrays;

  StartClock;
  for i := 0 to N - 1 do
    if Strings1[i] = Strings2[i] then
      b := not b;
  StopClock;
  OutputClock;

  StartClock;
  for i := 0 to N - 1 do
    if SameStr(Strings1[i], Strings2[i]) then
      b := not b;
  StopClock;
  OutputClock;

  Pause;

end.

where

function RandomString(const LowerLimit: integer = 2; const UpperLimit: integer = 20): string;
var
  N, i: integer;
begin
  N := RandomRange(LowerLimit, UpperLimit);
  SetLength(result, N);
  for i := 1 to N do
    result[i] := RandomChar;
end;

and the inlined

function RandomChar: char;
begin
  result := chr(RandomRange(ord('A'), ord('Z')));
end;

and the "clock" functions just wrap QueryPerformanceCounter, QueryPerformanceFrequency, and Writeln, produces the output

2.56599325762716E-0002
1.24310093156453E-0002
ratio ~ 2.06

If the difference in length of the two strings that we compare is large, then the difference is even bigger. We try

Strings1[i] := RandomString(0, 0); // = '';
Strings2[i] := RandomString(0, 40);

and obtain

1.81630411160156E-0002
4.44662043198641E-0003
ratio ~ 4.08

So why doesn't the compiler use the SameStr code when writing assembly for S1 = S2?

Update

After reading Cosmin Prund's excellent answer, I couldn't resist setting

Strings1[i] := RandomString(40, 40);
Strings2[i] := RandomString(40, 40);

to produce strings of equal length and indeed.

2.74783364614126E-0002
1.96818773095322E-0002
ratio ~ 1.40

Hm... SameStr still wins...

My Specs

CPU Brand String: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
Memory: 6 GB
OS: Windows 7 Home Premium (64-bit)
Compiler/RTL: Delphi 2009

Update

It would seem (see the comments below Cosmin Prund's excellent answer) like the = operator was changed between D2009 and D2010. Can anyone confirm this?

+4  A: 

SameStr has an optional third parameter: LocaleOptions. You get the behaviour similar to "=" by leaving out the third parameter: a case senstive locale independent comparison.

You would think that is the same as a binary comparison, but it isn't.

Since D2009 Delphi strings have an "code page" payload in addition to the length and the refcount.

  StrRec = packed record
    codePage: Word;
    elemSize: Word;
    refCnt: Longint;
    length: Longint;
  end;

When you do a String1 = String2 you are telling the compiler to ignore all the information about the string and simply do a binary comparison (it uses UStrEqual for that).

When you do a SameStr or CompareStr (which is used by SameStr) Delphi will first check the string for being Unicode (UTF-16LE) and if not, convert them before doing the actual work.

You can see this when you look at the implementation of CompareStr (the one without the third parameter) which, after initial optimisations, checks whether the arguments are unicode strings and if not, converts them using UStrFromLStr.

Update:

Actually, UStrEqual (by means of UStrCmp) also does conversions, like CompareStr it looks at the elemSize of the strings to decide whether they are Unicode or not and converts them if they aren't.

So the reason why the compiler does not use SameStr (CompareStr) for the = operator eludes me at the moment. The only thing I can think of is that it has a nice analogy to the LStrEqual used to '='-compare AnsiStrings. I guess only the compiler people know.

Sorry to have wasted your time. I am leaving the answer though, so others won't have to go down this route of investigation.

Marjan Venema
Hm... But shouldn't this imply that = is faster than SameStr?
Andreas Rejbrand
@Anders. No, see my update. '=' does do conversions. It just uses a different compare loop which I guess is a lot less performant than the one in CompareStr.
Marjan Venema
Not really a big thing, but my name is actually "Andreas".
Andreas Rejbrand
@Andreas: Ooooops... Sorry. Just pretend I mistook you for Anders Helsjberg ...
Marjan Venema
+16  A: 

Answer

It all depends on how you're building the random strings. I used a modified version of the code, because very few of us have the RejbrandCommon unit, and because I wanted to use Excel to finish my analyses (and make pretty pictures).

Code (skip over the code to see some conclusions):

program Project3;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

const
  StringsNumber = 2000000;

var
  Strings1, Strings2: array of string;
  StrLen: integer;
  b: {dummy }boolean;

function RandomString(MinLen, MaxLen:Integer):string;
var N, i:Integer;
begin
  N := MinLen + Random(MaxLen-MinLen);
  Assert(N >= MinLen); Assert(N <= MaxLen);
  SetLength(Result, N);
  for i:=1 to N do
    Result[i] := Char(32 + Random(1024)); // Random Unicode Char
end;

procedure CreateRandomStringArrays(StrLen:Integer);
var
  i: integer;
  StrLen2:Integer;
begin
  SetLength(Strings1, StringsNumber);
  SetLength(Strings2, StringsNumber);
  for i := 0 to StringsNumber - 1 do
  begin
    StrLen2 := StrLen + Random(StrLen div 2);
    Strings1[i] := RandomString(StrLen, StrLen2);
    StrLen2 := StrLen + Random(StrLen div 2);
    Strings2[i] := RandomString(StrLen, StrLen2);
  end;
end;

var C1, C2, C3, C4:Int64;

procedure RunTest(StrLen:Integer);
var i:Integer;
begin
  CreateRandomStringArrays(StrLen);

  // Test 1: using equality operator
  QueryPerformanceCounter(C1);
  for i := 0 to StringsNumber - 1 do
    if Strings1[i] = Strings2[i] then
      b := not b;
  QueryPerformanceCounter(C2);

  // Test 2: using SameStr
  QueryPerformanceCounter(C3);
  for i := 0 to StringsNumber - 1 do
    if SameStr(Strings1[i], Strings2[i]) then
      b := not b;
  QueryPerformanceCounter(C4);

  // Results:
  C2 := C2 - C1;
  C4 := C4 - C3;
  WriteLn(IntToStr(StrLen) + #9 + IntToStr(C2) + #9 + IntToStr(C4));
end;

begin

  WriteLn('Count'#9'='#9'SameStr');
  for StrLen := 1 to 50 do
    RunTest(StrLen);

end.

I made the CreateRandomStringArrays routine take an StrLen parameter so I can run multiple similar tests in a loop. I made the code use QueryPerformanceCounter directly and WriteLn the results in tab-delimited format so I can copy/paste it into Excel. In Excel I get results in this form:

StrLen  =   SameStr
1   61527   69364
2   60188   69450
3   72130   68891
4   78847   85779
5   77852   78286
6   83612   88670
7   93936   96773

I then normalized things a bit. On each line made the maximum value "1" and the other value an percentage of 1. The result looks like this:

StrLen  =   SameStr
1   0,88    1
2   0,86    1
3   1   0,95
4   0,91    1
5   0,99    1
6   0,94    1
7   0,97    1

And then I started playing with the CreateRandomStringArrays routine to run multiple tests.

This is how the plot looks like for the original case (CreateRandomStringArrays generates strings of random length, of length 1 to whatever's on the X axis). Blue is the result for the "=" operator, red is the result for the "SameStr", lower is better. It clear shows SameStr() has an edge for strings longer then 10 chars.

alt text

Next test, made CreateRandomStringArrays return strings of EQUAL length. The content of the strings is still fully-random, but the length of the strings is equal to whatever is on the X axis. This time the "=" operator is clearly more efficient:

alt text

Now the real question is, with REAL code, what's the probability of strings being equal? And how large has to be the difference for SameStr() to start gaining terrain? Next text, I'm building two strings, the first one's of StrLen (the number on the X axis), the second string has a length of StrLen + Random(4). Again, the "=" operator is better:

alt text

Next test, I have two strings, each of length: StrLen + Random(StrLen div 10). The "=" operator is better.

alt text

... and my final test, strings of +/- 50% length. Formula: StrLen + Random(StrLen div 2). The SameStr() wins this round:

alt text

Conclusion

I'm not sure. I didn't expect this to be linked to string length! I'd expect both functions to handle strings of different lengths lightning fast, but it doesn't happen.

Cosmin Prund
+1 (+10 if I could). Very impressively thorough. And very interesting results. I sure am gonna take a second look at this and see what implications it has for the string-intensive work in the apps I work on.
Marjan Venema
+1 Very good analysis. I've always known that SameStr is particulary good at strings of different length, but it came as a surprise to me that = actually is faster for strings of the same length. You learn something new every day!
Andreas Rejbrand
Hm... I am not able to reproduce the result that "=" wins for equal-length strings...
Andreas Rejbrand
@Andreas, here's a downloadable version of my code, should compile unchanged on your computer, set up for the "strings of equal length" test. Download it, try it, let me know: http://fisiere.sediu.ro/PentruForumuri/StrDemo.dpr
Cosmin Prund
My result: http://privat.rejbrand.se/streqlen.txt. `SameStr` is much faster.
Andreas Rejbrand
This is getting interesting. Here are some results of mine, from multiple machines: http://fisiere.sediu.ro/PentruForumuri/dual_quad_xeon.txt and http://fisiere.sediu.ro/PentruForumuri/intel_core_2_duo.txt ; Tomorrow I'll test on an Intel i7, probably similar to yours. My guess: It's compiler related, not CPU related. I'm on Delphi 2010. If you don't mind, test with my compiled code: http://fisiere.sediu.ro/PentruForumuri/Project3.zip ; Just run the EXE inside it, it will do the test, dumping the results to console and to a "MyResult.txt" file next to the exe.
Cosmin Prund
@Cosmin Prund: You are right! See my results made from your *.exe: http://privat.rejbrand.se/streqlen2.txt Now the = operator outperforms the `SameStr` function. Clearly something happened from D2009 to D2010. The `SameStr` has more or less the same performance, but the `=` operator seems much faster.
Andreas Rejbrand
@Andreas: it would be interesting to see the asm, then we know the difference! (btw both are running with same build options eg optimization?)
Remko
@Remko: I have tried to build the project both in debug and release mode. I have used D2009 to build Cosmin Prund's DRP: http://privat.rejbrand.se/streqlen.exe. Feel free to compare it with Cosmin Prund's EXE.
Andreas Rejbrand
+1  A: 

On my system, "=" is faster than SameStr.

SameStr does get faster(about 20%) with the "RandomString(0,0)" exemple. but then again, if it is the 2nd string that is set to '', the performances are nearly the same. After some more testing, it seems it's the not difference of length that makes the difference of performance, it's the empty string that does.

Cosmin Prund just posted a much more thorough analysis...

One thing that should be kept in mind is that, for functions that are that small ( 1 millions test in a few msecs), the actual processor running the code might make a big difference. The ASM code might be a little more friendly to the BPU of 1 processor than the other... Or some instruction might run more efficiently on different CPU. Data alignment might be affecting it. Cache miss. Those are just a few exemple of things at the hardware level that might affect the final performance.

For information, the tests I did was on a Phenom X4 CPU.

Ken Bourassa