views:

1672

answers:

10

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

values = line.Split(delimiter);

where line is the a string that holds the values that are seperated by the delimiter.

Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.

Thanks!

+8  A: 

It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:

1,"Something, with a comma",2,3

The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.

That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.

So if you want tos peed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.

The algorithm for that is relatively simple:

  • Stop at every comma;
  • When you hit quotes continue until you hit the next set of quotes;
  • Handle escaped quotes (ie \") and arguably escaped commas (\,).

Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.

So if you really want performance, it's time to hand parse.

Or, better yet, use someone else's optimized solution like this fast CSV reader.

By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll vs a hand-coded loop: Putting char into a java string for each N characters.

cletus
I've linked an answer on a similar topic about string replace methods, you'll find the link at the end of my own answer to this question.
John Leidegren
+5  A: 

The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.

But there's one thing you can do and that's to implement this as a generator:

public static IEnumerable<string> GetSplit( this string s, char c )
{
    int l = s.Length;
    int i = 0, j = s.IndexOf( c, 0, l );
    if ( j == -1 ) // No such substring
    {
        yield return s; // Return original and break
        yield break;
    }

    while ( j != -1 )
    {
        if ( j - i > 0 ) // Non empty? 
        {
            yield return s.Substring( i, j - i ); // Return non-empty match
        }
        i = j + 1;
        j = s.IndexOf( c, i, l - i );
    }

    if ( i < l ) // Has remainder?
    {
        yield return s.Substring( i, l - i ); // Return remaining trail
    }
}

The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.

The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.

John Leidegren
Apparently, there is no need to save memory, but there is a need to save CPU.
Dave Van den Eynde
@Dave Van den Eynde - I think it's important to do both! But yeah, memory optimization are greatly overlooked by most programmers.
John Leidegren
I did an approach similar to this, and it was slower than the existing algorithm that used Split, but because we were processing such large strings (multiple Megabytes), it saved about 30% on the RAM consumption.
torial
You know, that code isn't optimized, and the reason why string.Split is faster is because it uses unsafe code. If you include that here, the running time is the same. Except this is a lot more memory efficent.
John Leidegren
A: 

You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.

MSalters
It will cause string data copies once you need to pass a StringShim to something that accepts a string. Unless your whole app works with shims instead.
Dave Van den Eynde
You can't assume that at all. I'll dig up the example using regex vs hand-coding where the regex solution was an order of magnitude slower.
cletus
Here it is http://stackoverflow.com/questions/537174/putting-char-into-a-java-string-for-each-n-characters
cletus
My point is that it's hard to be faster *with the same interface*. My StringShim solution is quite explicit changing the split() interface to make things faster.
MSalters
A: 

You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.

You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.

One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.

If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.

I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.

Dave

Dave Van den Eynde
A: 

It makes sense that it takes longer to process the line than it does to read it. the question I must ask: Is it really that imperative that you need to care how long this method takes?

standard rants about premature optimization apply; I would suggest looking at the overall performance of your app(comapred to business requirement) before diving into something so granular.

Causas
A: 

Some very thorough analysis on String.Slit() vs Regex and other methods.

We are talking ms savings over very large strings though.

Charlie
Normally I like .Net Perls, but I think their comparison is unfair. If you know you are going to use a Regex a lot, you compile it and extract it from the loop. You'll get some big reductions on the overall time using that strategy.
torial
+1  A: 

I recently had a similar problem with splitting CSV files down...

http://www.elamparuthi.com/2007/04/29/connect-to-a-csv-file-using-odbc-c/

I found this to be an interesting by extremely useful solution to the problem :)

Dave

Dave
A: 

The main problem(?) with String.Split is that it's general, in that it caters for many needs.

If you know more about your data than Split would, it can make an improvement to make your own.

For instance, if:

  1. You don't care about empty strings, so you don't need to handle those any special way
  2. You don't need to trim strings, so you don't need to do anything with or around those
  3. You don't need to check for quoted commas or quotes
  4. You don't need to handle quotes at all

If any of these are true, you might see an improvement by writing your own more specific version of String.Split.

Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.

The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.

However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.

Lasse V. Karlsen
A: 

CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.

The ODBC solution recommended above looks at first glance to be basically the same approach.

I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.

kpollock
+1  A: 

Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.

Hope this helps