views:

1075

answers:

9

I've seen second one in another's code and I suppose this length comparison have been done to increase code productivity. It was used in a parser for a script language with a specific dictionary: words are 4 to 24 letters long with the average of 7-8 lettets, alphabet includes 26 latin letters plus "@","$" and "_".

Length comparison were used to escape == operator working with STL strings, which obviously takes more time then simple integer comparison. But in the same time first letter distribution in the given dictionary is simply wider than a distribution of words size, so two first letters of comparing strings will be generally more often different, than the sizes of that strings. That makes length comparison unnecessary.

I've ran some tests and that is what I've found out: While testing two random strings comparison million times, second way is much faster, so length comparison seems to be helpful. But in a working project it works even slower in a debug mode and insufficiantly faster in a release mode.

So, my question is: why length comparison can fasten the comparison and why can it slow it down?

UPD: I don't like that second way either, but it had been done for a reason, I suppose, and I wonder, what is this reason.

UPD2: Seriously, the question is not how to do best. I'm not even using STL strings in this case anymore. There's no wonder that length comparison is unnecessary and wrong etc. The wonder is - it really tends to work slightly better in one certain test. How is this possible?

A: 

fire your implementation of STL. It should not matter

Greg Dean
+29  A: 

If it mattered, assume that your library did it already. Don't mess up your code this way for micro-optimisations unless it really matters.

1800 INFORMATION
+4  A: 

In your random test the strings might have been long enough to show the gain while in your real case you may deal with shorter strings and the constant factor of two comparison is not offset by any gain in not performing the string comparison part of the test.

Remo.D
+4  A: 

The implementation of the std::string operator== has no way of knowing whether it would be faster to check the length first or start checking characters. Clearly checking the length is a waste for strings of the same length. Therefore, different implementations of STL are likely to perform differently.

Only put the explicit length check in as a final optimisation (clearly commented as such), and only if your profiler confirms the benefit.

James Hopkin
+5  A: 

Generally, you should leave this to the STL and not worry about it.

However, if this IS an area you need to optimise (which I seriously doubt), AND if you understand the letter distribution/length distribution of your strings, you could derive a new class from string, and overload the == operator to perform the equality test in the most efficient way for your application. (Length first, first character first, forwards, backwards, whatever).

That would be better than having the 'optimization' scattered throughout your code.

Roddy
Actually I did. I've made static 8 DWORD reverse strings implementation for this specific parser and gained sufficient profit in productivity, but this is not the issue. I still have no answer for a question foregoing.
akalenuk
Don't derive from std::string, it doesn't have a virtual destructor.
rlbond
+1  A: 

length comparison doesn't make any sense to me .. using the comparison operator is enough

Zuhaib
It doesn't make any difference as to what operator== returns. However, if you have something like: std::string str1 = "a long string that ends with X"; std::string str2 = "a long string that ends with XX";Then checking the length will make it faster because you don't have to go through the whole string.
dan04
A: 

The length comparison is there to try some short-circuit optimization.

I'm assuming the length comparison is quicker than the full string compare, so if that can eliminate 99% of mismatches, it'll be quicker than doing the full string compare each time.

The code will execute the length comparison, it'll fail, then it will ignore the full string compare and skip the code.

databyss
+10  A: 

When can Short Circuiting be beneficial

Short circuit optimizations can be helpful only when:

  • the cost of comparison is low compared to the cost of the full test
  • the comparison often results in short circuiting

Mathematically, let S be cost of Short Circuiting condition, F cost of full condition, and P be percent of cases where Short Circuiting happens (full condition is not necessary).

The average cost of original case (no Short Circuiting) is F

The average cost of Short Circuiting optimization is S + F * (1-P)

Therefore, if the optimization is to have any benefit at all, following must apply:

S + F * (1-P) < F

i.e.

S < F*P

String comparison cost

Further you wrote:

which obviously takes more time then simple integer comparison.

This is not obvious at all. The string comparison terminates when first difference is found, therefore depending on what strings you process, it may terminate on first or second character in vast majority of cases. Moreover, the comparison may be optimized even for longer strings by first comparing DWORDS (4 characters at once) as long as there is enough data in both strings.

Your case

The major difference between random test data and scripting parsing is the real data are far from random. The parser is most likely deterministic, and once it matches, it does not compare any more. Even the script data are not random - some keywords are likely to be used a lot more than other ones. If the parser is constructed in such a way it checks most commonly used keyword first, a surprisingly high number of comparisons may need the full compare to be done, as full compare always needs to be done when string are matching.

Suma
A: 

The length of the std::string is quite likely a member of the std::string object. In comparison, the first character might very well be on the heap. That means that comparing the string length improves locality of reference. Of course, with the short-string optimization this becomes even more complex - Lhs[0] might be on the heap while Rhs[0] is on the stack.

MSalters