tags:

views:

1488

answers:

13
+3  Q: 

CSV Parsing

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.

Now this will not work if I have a value like:

"a",""b","x","y"","c"

then my output is:

'a'
'"b'
'x'
'y"'
'c'

but what I want is:

'a'
'"b","x","y"'
'c'

Is there any regex or any other logic I can use for this ?

A: 

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:

""

\"

In the second form your initial string would look like this:

"a","\"b\",\"x\",\"y\"","c"

If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

dasil003
no, i don't think that's true. in his example, as long as you assume that the CSV is valid for as long as possible (and not just give up at the '"",b' part), then you could still parse that.
nickf
There is every chance that you can achieve the right result - its just more work. I have code that successfully does this (though not using regex).
Murph
A: 

Well, I'm no regex wiz, but I'm certain they have an answer for this.

Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.

Each time you run into a quote toggle dontMatch.

each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.

This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.

For instance,

"a", ""b", ""c", "d"", "e""

will yield bad results.

This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.

To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.

Adam Davis
+1  A: 

If all your values are guaranteed to be in quotes, look for values, not for commas:

("".*?""|"[^"]*")

This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.

If you don't want the enclosing quote to be part of the match, use:

"(".*?"|[^"]*)"

and go for the value in match group 1.

As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

Tomalak
Thank you...for your informative reply ,this seems work for me.
+6  A: 

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.

*=remember that some locales use [tab] as the C in CSV...

Marc Gravell
My locale uses semicolons for the "C".. don't get me started with Excel and _comma_ separated files that aren't parsed correctly because the comma is an actual comma ;)
VVS
A: 

FileHelpers for .Net is your friend.

mlarsen
+1  A: 

There's an oft quoted saying:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.

Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:

    public IEnumerable<string> SplitCSV(string line)
    {
        int index = 0;
        int start = 0;
        bool inString = false;

        foreach (char c in line)
        {
            switch (c)
            {
                case '"':
                    inString = !inString;
                    break;

                case ',':
                    if (!inString)
                    {
                        yield return line.Substring(start, index - start);
                        start = index + 1;
                    }
                    break;
            }
            index++;
        }

        if (start < index)
            yield return line.Substring(start, index - start);
    }

Standard caveat - untested code, there may be off-by-one errors.

Limitations

  • The quotes around a value aren't removed automatically.
    To do this, add a check just before the yield return statement near the end.

  • Single quotes aren't supported in the same way as double quotes
    You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)

  • Whitespace isn't automatically removed
    Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

Bevan
Don't forget that multi-line is also an option for quoted csv, and you'd need to test it with escaped quotes like "some""data" etc...
Marc Gravell
And then there are those people who roll their own string handling routine every time because they've heard that regular expressions just don't cut it. If you *know the data your are dealing with* regexes are fine. @xyz did not ask how to write a full-fledged ready-for-all-eventualities CSV parser.
Tomalak
What about 'inString' variable? Some part off logic is missing.
saku
@Tmoaluk: I agree that Regexes are a great tool, when used in the right place - but they are too frequently used in other places as well. When the problem is this simple, I feel that "Do the simplest thing that could possibly work" does not mean a Regex. YMMV.
Bevan
@Saku: You're right; I've corrected the code.
Bevan
Still doesn't work. This code leaves the quotes around double-quoted strings, which is unwanted behavior.
Bklyn
@Bklyn - You're correct, the quotes aren't removed by this. I don't want to make the code more complicated, but will make a note of this limitation.
Bevan
@Tomalak: I think this solution is probably more performant, just as easy to maintain and easy to extent (maybe simpler, I know a lot of accomplished developers who rather not dissect others regex's) and less prone for abuse (by nature it's lazy loaded through the Enumerator)
Cohen
A: 

See the link "Regex fun with CSV" at:

http://snippets.dzone.com/posts/show/4430

A: 

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

Chris S
+1  A: 

I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.

Just for sake of exercising my mind, quick & dirty working C# procedure:

public static List<string> SplitCSV(string line)
{
    if (string.IsNullOrEmpty(line))
        throw new ArgumentException();

    List<string> result = new List<string>();

    bool inQuote = false;
    StringBuilder val = new StringBuilder();

    // parse line
    foreach (var t in line.Split(','))
    {
        int count = t.Count(c => c == '"');

        if (count > 2 && !inQuote)
        {
            inQuote = true;
            val.Append(t);
            val.Append(',');
            continue;
        }

        if (count > 2 && inQuote)
        {
            inQuote = false;
            val.Append(t);
            result.Add(val.ToString());
            continue;
        }

        if (count == 2 && !inQuote)
        {
            result.Add(t);
            continue;
        }

        if (count == 2 && inQuote)
        {
            val.Append(t);
            val.Append(',');
            continue;
        }
    }

    // remove quotation
    for (int i = 0; i < result.Count; i++)
    {
        string t = result[i];
        result[i] = t.Substring(1, t.Length - 2);
    }

    return result;
}
saku
I really disliked FileHelpers. Too much manual configuration.
Josh Close
A: 

hello tomalak,

I have just try your regular expression in my code..its work fine for formated text with quote ...

but wondering if we can parse below value by Regex..

"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"@sib.com"

I am looking for result as:

'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"@sib.com'

Thanx

No, that is impossible with my solution, because my regex depends on the quotes to delimit the values. You should really think about using one of the other solutions (e.g. use a parser).
Tomalak
If you can do anything about it, change the CSV format to something less ambiguous. Don't use commas or quotes when both can occur within the values, or at least consistently escape commas and quotes within values.
Tomalak
If you use a delimiter in your DSV that isn't going to occur in your values (or is going to occur very rarely, and you just escape it if that happens), you don't even need to worry about the quotes. Colons are pretty common as a delimiter, in my experience.
Adam Jaskiewicz
+4  A: 

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

kanja
A: 

Try CsvHelper or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

Josh Close
A: 

FileHelpers supports multiline fields.

You could parse files like these:

a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"

Here is the datatype declaration:

[DelimitedRecord(",")]
public class MyRecord
{ 
 public string field1;
 [FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
 public string field2;
}

Here is the usage:

static void Main()
{
 FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
 MyRecord[] res = engine.ReadFile("file.csv");       
}
NahuelGQ