views:

42

answers:

2

I need to load in a large CSV file (>1MB) and parse it. Generally this is quite easy to do by splitting first on linebreaks and then commas. The problem is though that some entries contain Strings that include their own commas. When this spreadsheet is converted to CSV, the lines containing commas are wrapped in quotes.

I've written a parser that first escapes all the commas in these strings, then splits it on linebreaks and then commas, and then unescapes the values again.

This is quite a slow process for such a long string, as I need to iterate through the whole string. Does anyone know a faster or more optimised method of dealing with this?

+1  A: 

Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values. Regards

Howard May
Hi Howard, thanks for your reply. Unfortunately one of the limitations of Flash and ActionScript is if you execute a script for more than 15 seconds, errors will be thrown. This value can be changed, but the SWF is completely unresponsive while this is happening so this isn't ideal. At the moment, I'm breaking the processing down using Robert Penner's Chunker class, which at least enables a loading animation to play, but I think that there must be a faster way to do this. I'm not sure what you mean by using a simple state machine to handle this.
danjp
+1  A: 

Have you had a look at csvlib yet? It is a parser library for ActionScript 3. It claims to be designed to properly handle quoted strings.

Hopefully, you are already enclosing your strings in quotes, especially the ones containing the commas. CSV parsers cannot distinguish a comma that is part of a string from a comma that separates two strings, unless the strings have quotes around them.

    
Good
    "This string, has a comma", "This string doesn't"

Bad
    This string, has a comma, this string doesn't
Robert Harvey
Hi Robert, any entries in the CSV that have their own commas are wrapped in quotes. I'll have a look at csvlib - thanks for the link.
danjp
FYI unless ActionScript is really slow, it shouldn't take longer than about a second to parse a 1 MB CSV file.
Robert Harvey
The extra time is because I'm first changing every comma that is wrapped in quotes into a tilde (and removing the quotes), then splitting the string on linebreaks and commas, and then changing every tilde back to a comma. Is there a better way to get around this?
danjp
Some CSV parsers do this by setting a boolean flag (i.e. IgnoreCommas) when they encounter an open quote, and clearing that flag when they encounter the close quote. Others use a regex. CSVLIB should already do whatever is necessary.
Robert Harvey