views:

343

answers:

4

I wrote some utility methods that can parse a 32-bit signed integer much faster than Int32.Parse. I hope you could give me some of your experience by reviewing my code and suggesting enhancements or pointing out bugs. If you are interested (and have the time of course), I would be very grateful.

I have posted my code on the "Refactor My Code" website. Here is the link: http://refactormycode.com/codes/699-int32-quick-parser

Thank you.


Edit: Thanks everybody for your helpful suggestions. I have followed your ideas, and here is the updated version: http://refactormycode.com/codes/699-int32-quick-parser#refactor_142758

(Please use a diff utility to see the changed code lines.)

Thank you!

+2  A: 

One though; consider renaming the class; calling it Int32 is just asking for trouble, although you get a bit of scope here because it is nested (Numbers.Int32). I wouldn't get too excited about this, though (the nesting does save a lot of pain for the consumer).

Don't call arguments thing like str - that isn't meaningful. That said, int.Parse calls it s, so I can't be too critical.

But I've got to grant you, a 10x performance figure isn't bad ;-p (although it only supports one format, of course).

One other thing; a start and length is a more common pattern than a start and end.

Finally - is there an off-by-one in the ParseSimple checks?

if (end > str.Length) throw new ArgumentOutOfRangeException("end");

Shouldn't that be >=? If length is 1, end can only be 0?

Marc Gravell
Thanks. I was thinking the name "Numbers.Int32" was the most intuitive, in the sense that methods would be called as "Numbers.Int32.ParseSimple(str)". And if someone wants to import it directly (instead of using the "Number." prefix) they could say "using IntUtil = Numbers.Int32". What do you think?
Hosam Aly
Perhaps don't worry about the Int32 thing; I've updated - the nesting seems to make it fairly safe. Of course, if Int32 was a *property* that returned a singleton instance of some type called something safer , then it might be clearer (but would use virtcall).
Marc Gravell
I think I first (and last) saw the property style in NUnit, but I didn't think much about it. But I don't feel it would be useful for this class in any way, would it? (And the virtcall would decrease performance a little, wouldn't it?)
Hosam Aly
Start and end was all over the STL, and that's where I learned it from. It seemed quicker at first, but now when I look it over I think length would be more usable. I'll probably change it.
Hosam Aly
I was going to comment on the assertions not checking input in release, but on reflection that *might* be OK for "Trusted" mode - but it will blow up with very unusual exceptions when the input is bad. A tricky one to call... if it isn't *significantly* quicker, I might simply cut this method.
Marc Gravell
The if condition is not an off-by-one. As said in the method documentation, end should be *after* the last valid index. (Same as in C++, where the end iterator points after the last item.) So end can be equal to length (and that's how other overloads call the method).
Hosam Aly
... Anyway, I intend to follow your advice and replace "end" with "length", so this won't matter.
Hosam Aly
Regarding debugging assertions for "Trusted" methods, I explicitly wrote in the documentation that unless the input is *known* to be valid, it should not be used. The assumption is that you "Trust" your input (which probably passed other validations first).
Hosam Aly
... From a performance point of view, ParseTrusted is almost twice as fast as ParseSimple, and that's why I believe it's valuable.
Hosam Aly
I have updated the implementation. Would you have time to look at it again? I updated the question to include the update link. Thank you!
Hosam Aly
May I ask a side question? Do you think the code would have been easier to review if I had published it without the documentation comments?
Hosam Aly
+1  A: 

So if you have parse trusted, then maybe you should consider implementing ParsePositive ParseNegative

And you should be watching the IL while editing your code

erdogany
Thanks for the suggestion. I'll probably implement ParsePositive and ParseNegative. I've also been watching the IL. Is there anything specific I should be looking at?
Hosam Aly
+2  A: 

One somewhat bizarre thought: rather than subtracting '0' on every iteration, you could have a precalculated amount to subtract at the very end, depending on the number of digits. (length=1, subtract '0'; length=2, subtract ('0'*10 + '0')). You may well only be able to get away with this for relatively short strings though, due to overflow. (Or use a long, but that may slow it down too much.)

Also, I haven't tried it, but you might check whether

result = (result << 3) + result + result + [...]

or

result = (result << 3) + (result << 1) + [...]

is faster than

result = result*10 + [...]

I suspect that it won't be, but it may be worth at least trying. (Certainly bitshifting is faster than multiplying, but I suspect the extra additions involved counterbalance this.)

Have you profiled the code to see where any bottlenecks are?

Do you have any benchmark results of how much faster this is than Int32.Parse, btw?

Jon Skeet
I shall try to check these. I haven't actually profiled this code.(Partly because I don't have a good profiler. I'd be thankful if you could refer me to a good free one.)
Hosam Aly
My current benchmarks show that for 25 million iterations over a constant string, my methods are 6 to 12 times faster than Int32.Parse. Marc also got similar results.
Hosam Aly
I have tried the shift operations, but they didn't provide a performance improvement. On larger strings they were slower, and on shorter strings they were faster, but in both cases the difference was negligible (1% or so).
Hosam Aly
The subtractions actually enhanced the performance, sometimes by 10%. :) And there is no fear for overflow, the int justifies itself correctly (as long as the bits are concerned). I shall add this to the code. Thanks!
Hosam Aly
I have updated the implementation. Would you have time to look at it again? I updated the question to include the update link. Thank you!
Hosam Aly
May I ask a side question? Do you think the code would have been easier to review if I had published it without the documentation comments?
Hosam Aly
(I couldn't think of anything else to say, btw, hence lack of later comment.) Yes, it would have been easier without doc comments.
Jon Skeet
+4  A: 

I think your implementation is pretty decent. I've made an unsafe version using pointer math, which is indeed faster, but is dirtier in my opinion.

Results (running 30 samples of 10M random int strings):

Int32.Parse                      1766.17 ms (stddev: 1.83 ms)
Numbers.Int32.ParseTrusted       262.07 ms (stddev: 0.44 ms)
ParseTrustedUnsafe               146.13 ms (stddev: 0.43 ms)

here is my unsafe implementation:

public static unsafe int ParseTrustedUnsafe(string str, int start, int end)
{
    unsafe
    {
     Int32 result = 0;
     Int32 length = end - start;
     Boolean isNegative = false;
     fixed (Char* startChar = str)
     {
      Byte* currentChar = ((Byte*)startChar) + (start * 2);
      if (*currentChar == 0x2D)
      {
       isNegative = true;
       currentChar += 2;
       length--;
      }
      else if (*currentChar == 0x2B)
      {
       currentChar += 2;
       length--;
      }
      while (length >= 4)
      {
       result = (result * 10) + (*currentChar - 0x30);
       result = (result * 10) + (*(currentChar + 2) - 0x30);
       result = (result * 10) + (*(currentChar + 4) - 0x30);
       result = (result * 10) + (*(currentChar + 6) - 0x30);

       currentChar += 8;
       length -= 4;
      }
      while (length > 0)
      {
       result = (result * 10) + (*currentChar - 0x30);

       currentChar += 2;
       length -= 1;
      }
     }
     return isNegative ? -result : result;
    }
}

Below is my validation and benchmark program:

static void Main(string[] args)
{
    String[] testValues = new String[10 * 100 * 100 * 100];
    Random rand = new Random();
    for (int i = 0; i < testValues.Length; i++)
    {
     testValues[i] = rand.Next().ToString();
    }
    // validate
    for (int i = 0; i < testValues.Length; i++)
    {
     Debug.Assert(String.CompareOrdinal(testValues[i], ParseTrustedUnsafe(testValues[i], 0, testValues[i].Length).ToString()) == 0, "ParseTrustedUnsafe failed on " + testValues[i]);
     Debug.Assert(String.CompareOrdinal(testValues[i], Numbers.Int32.ParseTrusted(testValues[i], 0, testValues[i].Length).ToString()) == 0, "Numbers.Int32.ParseTrusted failed on " + testValues[i]);
    }
#if DEBUG
    return;
#endif    
    List<List<Double>> results = new List<List<double>>();
    for (int i = 0; i < 3; i++)
    {
     results.Add(new List<double>());
    }

    Thread.CurrentThread.Priority = ThreadPriority.Highest;
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
    if (Environment.ProcessorCount > 1)
    {
     Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(1 << (Environment.ProcessorCount - 1)); // use last core only
    }
    Stopwatch sw = new Stopwatch();
    for (int testrun = 0; testrun < 30; testrun++)
    {
     sw.Reset();
     sw.Start();
     for (int i = 0; i < testValues.Length; i++)
     {
      Int32.Parse(testValues[i]);
     }
     sw.Stop();
     results[0].Add(sw.ElapsedMilliseconds);

     sw.Reset();
     sw.Start();
     for (int i = 0; i < testValues.Length; i++)
     {
      Numbers.Int32.ParseTrusted(testValues[i]);
     }
     sw.Stop();
     results[1].Add(sw.ElapsedMilliseconds);

     sw.Reset();
     sw.Start();
     for (int i = 0; i < testValues.Length; i++)
     {
      ParseTrustedUnsafe(testValues[i], 0, testValues[0].Length);
     }
     sw.Stop();
     results[2].Add(sw.ElapsedMilliseconds);

    }
    Console.WriteLine("Int32.Parse \t\t\t {0:0.00}ms (stddev: {1:0.00}ms)", results[0].Average(), StandardDeviation(results[0]));
    Console.WriteLine("Numbers.Int32.ParseTrusted \t {0:0.00}ms (stddev: {1:0.00}ms)", results[1].Average(), StandardDeviation(results[1]));
    Console.WriteLine("ParseTrustedUnsafe \t\t {0:0.00}ms (stddev: {1:0.00}ms)", results[2].Average(), StandardDeviation(results[2]));
}

private static double StandardDeviation(IEnumerable<double> doubleList)
{
    double average = doubleList.Average();
    double sumOfDerivation = 0;
    foreach (double value in doubleList)
    {
     sumOfDerivation += (value) * (value);
    }
    double sumOfDerivationAverage = sumOfDerivation / doubleList.Count();
    return Math.Sqrt(sumOfDerivationAverage - (average * average));
}
Davy Landman
Well, thanks for sharing! I had thought about creating an unsafe implementation, but on my machine I didn't observe a noticeable difference, but I didn't try loop unrolling. Maybe I should learn to do this. I never thought about using byte pointers instead of char. Would it really make a difference?
Hosam Aly
You may enhance performance further by removing the subtraction of 0x30 from the loop, and performing it once at the end (as per Jon Skeet's suggestion; see my updated implementation).
Hosam Aly
Hosam Aly
adding the reduction later does not save that much ParseTrustedUnsafe 442,13ms(stddev:3,50ms),ParseTrustedUnsafe2 428,70ms(stddev: 3,61ms)...In my opinion, trusted really trusts the caller.. so allows the caller to shoot himself in the foot. And for other questions, make it a challenge for yourself?
Davy Landman