views:

365

answers:

11

I am writing a high performance parser, and it seems to me that Int32.Parse can be too slow. I wrote a simple version that assumes correct input, and it's performing much better. So should I create my own version instead? Or is there another faster method already available?

My method is like this:

// parse simple int, assuming relatively correct input (i.e. all digits)
public static int ParseInt32Simply(string str) {
    if (str == null) throw new ArgumentNullException("str");
    if (str.Length == 0) throw new ArgumentException("str is empty");

    int sign = 1, index = 0;
    if (str[0] == '-') { sign = -1; index = 1; }
    else if (str[0] == '+') { index = 1; }

    int result = 0;
    for (; index < str.Length; ++index) {
        result = 10 * result + (str[index] - '0');
    }

    if (result < 0) throw new OverflowException(str + " is too large for Int32");

    return result * sign;
}

My results are very different from the builtin equivalent:

Int32.Parse      took 8.2775453 seconds
ParseInt32Simply took 0.6511523 seconds
Int32.Parse      took 6.7625807 seconds
ParseInt32Simply took 0.4677390 seconds

(Running 25 million iterations on my machine; a P4 3 GHz, running VS 2008 SP1)

So, should I use my version? Or is there another method available that I can use?

+6  A: 

Have you profiled your code yet to determine that ParseInt32 is actually the bottleneck? I wouldn't replace something that's part of the "standard library" of the environment you're coding in unless you know for certain that you're going to see a benefit.

Zach Hirsch
I ran the two methods against each other, and I got the timing results mentioned above.
Hosam Aly
I meant profile application you're writing, not just this particular function. In that way, you can find out if the bottleneck in your program is Int32.Parse or some other place. You might be spending a lot of time doing some other processing.See http://en.wikipedia.org/wiki/Performance_analysis
Zach Hirsch
Well, I think the application cannot be more optimized now except for very few parts. But Int32.Parse certainly seems like it's taking a good percentage (although I haven't used a profiler to test my hypothesis).
Hosam Aly
Hosam, the only way to answer to your question is to use a profiler to prove that this is going to benefit your application in a measurable way. Are you going to be parsing 25 million integers?
Simon Steele
I am parsing 1.5 - 2 million messages, containing many integral numbers (dates, times, integers, ...), so I think my parser will benefit tremendously from the added performance. However, I was hoping that I would find a library that I could use, instead of creating my own! :(
Hosam Aly
+3  A: 

I think the main problem here is your sentence assumes correct input. From reading your code, it doesn't appear to handle "12x" properly.

There's lots of things that Int32.Parse does to validate the input, and might even take note of your culture to handle some culture-differences, although I can't think of any specifically for Int32.

You're sure that the bottleneck is Int32 in your code?

Lasse V. Karlsen
To enhance performance, this assumption has been made. I am receiving input from a trusted source, so I'm simplifying some conditions (hence the name "Simply"). It's part of the method documentation that cases such as "12x" will not be handled properly.
Hosam Aly
I know that Int32.Parse does a lot, and that's why I'm thinking of rolling my own. (Actually I believe it does much more than it should do in the most common case. You may check the internal System.Number.ParseNumber method.)
Hosam Aly
As you can see in the timing results above, there is a vast difference between the timings of my method and Int32.Parse. For my parser this may be a 5x speed increase!
Hosam Aly
A: 

The validation for null and empty string is not enough, you should check if parameter is valid integer.

Rumen Georgiev
I am simplifying the method by assuming that the input is correct. If the caller is not so sure then he should call Int32.Parse. (I'm not suggesting to replace Int32.Parse; I'm just making it simpler.)
Hosam Aly
+4  A: 

My view would be that if the time savings you are getting are significant and of benefit for your application, then go for it.

We had a vaguely similar issue with XML parsing and opted to do it manually for performance reasons, but it was based on a known environment - we were feeding the XML, so we could fairly safely take shortcuts in the parsing.

Obviously the risk is that it is not likely to be complete as the standard library version and so new developers to the team will need to be made aware of this, lest they do something to break it.

Chris Kimpton
A: 

how does your test look like? It seems your test is not ok.

I only have a small difference when I loop 50000 times and then I have a difference of about 30K ticks in favour of your custom method, but that is neglectable for the advantages of the normal method

Mafti
I ran my test 25 million times. I've updated my question to indicate that. Sorry for forgetting to mention it.
Hosam Aly
+5  A: 

In .net Int32.Parse is very very quick, when it's successful.

When it fails it throws an exception - then it's very slow because exceptions are slow.

You need to expand your test - you need to check times for a pattern of good and bad strings that's similar to what you need it to do.

If you are pretty sure that all your strings are valid ints then Int32.Parse is the way to go. If you suspect that any more than a negligible few are going to be valid then it is much quicker to use Int32.TryParse, rather than a try-catch in your loop.

Typically if your try-catch is outside the loop use Int32.Parse - you'll get an exception and stop the first time you get an invalid value.

If your try-catch is inside the loop use Int32.TryParse instead.

Both Int32.Parse and Int32.TryParse are pretty highly optimised and are relatively mature - I'd expect them to be very tough to improve on unless you have specialist circumstances.

Keith
I'm running my test using only successful cases, which should be the quickes way for Int32.Parse. But still, it's too slow compared to my method.
Hosam Aly
Int32.Parse also handles localisation - US users can parse "10,000" and French users can parse "10 000". If you don't need any of that then go for it.
Keith
+1  A: 

How do you measure the speed? I tried this:

Stopwatch sw = new Stopwatch();
Random rand = new Random();

for (int n = 0; n < 10; n++)
{
    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        ParseInt32Simply(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - ParseInt32Simply");
    sw.Reset();

    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        int.Parse(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - int.Parse");
    sw.Reset();
    Console.WriteLine();
}

and the results are quite different:

2932852 - ParseInt32Simply
4684522 - int.Parse

3003988 - ParseInt32Simply
4666928 - int.Parse

2892545 - ParseInt32Simply
4660209 - int.Parse

2888998 - ParseInt32Simply
4636007 - int.Parse

2955727 - ParseInt32Simply
4668501 - int.Parse

2929210 - ParseInt32Simply
4653799 - int.Parse

2893706 - ParseInt32Simply
4671503 - int.Parse

2899547 - ParseInt32Simply
4633957 - int.Parse

Your simple method is still faster, but less than 2x (that is very good performance actually!).

Rumen Georgiev
Try and make another loop that just does rand.Next().ToString(). You know, that might add 2,500,000 to that score :) In that case, ParseInt32Simply is 2,000,000 times faster :P
Statement
I'm running a similar test but using a static string (instead of rand, so as to get more accurate timings). But I'm getting the results I wrote in the question.
Hosam Aly
+4  A: 

Yes - you can use your own version of parsing int as long that you are 100% certain that the source data is something you have control over (and thus always conforms to your format of Int32). Also, you should use your own code isolated from the rest of the world because if you've got this in some library you're publishing, people might want to have the standard behaviour of Int32.Parse. If you can't provide it, it's no good for them. However, as many people here suggests, you should really be certain that this is what really needs doing if you're trying to squeeze out the most of your performance. However, you probably know your own code better than anyone here.

Personally I would try and avoid changing parsing. IF there are other bottlenecks then those might be worth investigating first.

Statement
The problem with Int32.Parse is that it uses a general method for parsing (internal System.Number.ParseNumber), which can parse any number form. This is certainly much slower than just parsing an int. Even if I expand my method to handle more cases, it consumes just double its original time.
Hosam Aly
+4  A: 

If your tests are verifiable and you truly need the performance gains (e.g. you call the function like tens of thousands of times a second) than go for it.

I would just change the name... because ParseInt32Simply does not tell the maintenance programmer anything. I think a name like TrustedSourceInt32Parse or GuaranteedInt32Parse or something along those lines is a better name.

Giovanni Galbo
Thanks for the suggestion. I couldn't think of a better name for my method. TrustedSourceInt32Parse looks good, but it's too long! Any more suggestions are welcome. :)
Hosam Aly
actually, I guess if you extend Int32 using extension methods, you can simply use "TrustedSourceParse".. e.g. int.TrustedSourceParse("1000");
Giovanni Galbo
Well, that's not possible, since this would be a static method, not an extension. :)
Hosam Aly
oops, good point; I didn't think that through.
Giovanni Galbo
+1  A: 

If your parsing a format of which you know to be valid numbers, you can indeed write a faster custom parser. I've written a Double.Parse function for the same purpose once. And it faster to begin with the least significant digit. That way you can just increment the power of the digit your parsing.

I've created a quick implementation of this,

public static Int32 ParseValidNumberAsInt32(string str)
{
    if (str == null) 
     throw new ArgumentNullException("str");
    if (str.Length == 0) 
     throw new ArgumentException("str is empty");
    Int32 result = 0;
    Int32 currentPower = 1;
    Boolean isNegative = str[0] == '-';

    for (int currentCharIndex = str.Length - 1; currentCharIndex > 0; currentCharIndex--)
    {
     result += (str[currentCharIndex] - '0') * currentPower;
     currentPower *= 10;
    }
    return isNegative ? -1 * result : result + ((str[0] - '0') * currentPower);
}

If you really want speed, you can write a unsafe implementation..

If your parsing a big file, you could read the files as raw bytes and work with those. That will make it a lot faster (no converting to unicode string, no splitting the strings in lines, no splitting the lines in substrings, no parsing the substrings), but you're going to lose maintainability.

Davy Landman
Thanks a lot for sharing your implementation. I'm not sure how beginning with the least significant digit is faster than the other way around. But I'll check it out.
Hosam Aly
As for the unsafe suggestion, it doesn't make a difference. Actually in some cases it was slower than the normal one, probably because it requires an additional instruction to "pin" the string (using "fixed").
Hosam Aly
I am actually reading from a StreamReader (wrapped over a socket), but since I'm reading ASCII then your point about working with raw bytes can probably make it faster. I'll try it and see what I get. Thanks a lot!
Hosam Aly
I've thought about why I began with the least significant digit, and I think it was because I was working with floats and wanted to keep the floating point math to a minimum (so keep track of the power instead of just multiplying the previous number). My estimate is the str[i] - '0' will be heavier.
Davy Landman
Indeed, working with raw bytes will make it faster, you can just use result += (data[i] - 0x30) * currentPower; and I suspect the IL is much cleaner.. Using a decent profiler you could pinpoint the line of code which is the biggest bottleneck and try to optimize that even more..
Davy Landman
+1  A: 

Take a look at this blog entry: Fast string to integer conversion by Karl Seguin.

dalle