I am using C# to read a ~120 MB plain-text CSV file. Initially I did the parsing by reading it line-by-line, but recently determined that reading the entire file contents into memory first was multiple times faster. The parsing is already quite slow because the CSV has commas embedded inside quotes, which means I have to use a regex split. This is the only one I have found that works reliably:
string[] fields = Regex.Split(line,
@",(?!(?<=(?:^|,)\s*\x22(?:[^\x22]|\x22\x22|\\\x22)*,)
(?:[^\x22]|\x22\x22|\\\x22)*\x22\s*(?:,|$))");
// from http://regexlib.com/REDetails.aspx?regexp_id=621
In order to do the parsing after reading the entire contents into memory, I do a string split on the newline character to get an array containing each line. However, when I do this on the 120 MB file, I get a System.OutOfMemoryException
. Why does it run out of memory so quickly when my computer has 4 GB of RAM? Is there a better way to quickly parse a complicated CSV?