tags:

views:

1525

answers:

8

I'm looking at parsing a delimited string, something on the order of

a,b,c

But this is a very simple example, and parsing delimited data can get complex; for instance

1,"Your simple algorithm, it fails",True

would blow your naiive string.Split implementation to bits. Is there anything I can freely use/steal/copy and paste that offers a relatively bulletproof solution to parsing delimited text? .NET, plox.

Update: I decided to go with the TextFieldParser, which is part of VB.NET's pile of goodies hidden away in Microsoft.VisualBasic.DLL.

A: 

I am thinking that a generic framework would need to specify between two things: 1. What are the delimiting characters. 2. Under what condition do those characters not count (such as when they are between quotes).

I think it may just be better off writing custom logic for every time you need to do something like this.

Vaibhav
+2  A: 

I am not aware of any framework, but a simple state machine works:

  • State 1: Read every char until you hit a " or a ,
    • In case of a ": Move to State 2
    • In case of a ,: Move to State 3
    • In case of the end of file: Move to state 4
  • State 2: Read every char until you hit a "
    • In case of a ": Move to State 1
    • In case of the end of the file: Either Move to State 4 or signal an error because of an unterminated string
  • State 3: Add the current buffer to the output array, move the cursor forward behind the , and back to State 1.
  • State 4: this is the final state, does nothing except returning the output array.
Michael Stum
CSV strings can include new line characters within text quotes, so you can't signal an error while in state 2 if end of line.
ck
That's actually true, I always forget about the dreaded \n character that screws up most CSV parsers. Clarified.
Michael Stum
+1  A: 

There are some good answers here: Split a string ignoring quoted sections

You might want to rephrase your question to something more precise (e.g. What code snippet or library I can use to parse CSV data in .NET?).

Patrick McElhaney
+2  A: 

Such as

                    var elements = new List<string>();
                    var current = new StringBuilder();
                    var p = 0;

                    while (p < internalLine.Length)
                    {
                        if (internalLine[p] == '"')
                        {
                            p++;
                            while (internalLine[p] != '"')
                            {
                                current.Append(internalLine[p]);
                                p++;
                            }
                            // Skip past last ',
                            p += 2;
                        }
                        else
                        {
                            while ((p < internalLine.Length) && (internalLine[p] != ','))
                            {
                                current.Append(internalLine[p]);
                                p++;
                            }
                            // Skip past ,
                            p++;
                        }
                        elements.Add(current.ToString());
                        current.Length = 0;
                    }
Stu
A: 

Simplest way is just to split the string into a char array and look for your string determiners and split char.

It should be relatively easy to unit test.

You can wrap it in an extension method similar to the basic .Spilt method.

Keith
A string is inherently a char array, you don't need to do any conversion
ck
+2  A: 

I use this to read from a file

  string filename = @textBox1.Text;
  string[] fields;
  string[] delimiter = new string[] {"|"};
  using (Microsoft.VisualBasic.FileIO.TextFieldParser parser =
             new Microsoft.VisualBasic.FileIO.TextFieldParser(filename)) {
                    parser.Delimiters = delimiter;
                    parser.HasFieldsEnclosedInQuotes = false;
                    while (!parser.EndOfData) {
                       fields = parser.ReadFields();

                       //Do what you need
                    }
  }

I am sure someone here can transform this to parser a string that is in memory.

Jedi Master Spooky
Coming back to this answer, I still think it is the best. I've tried the FileHelpers and have come to the conclusion that they scare me. I don't trust a framework which relies on the order that fields are defined on a type.
Will
+1  A: 

To do a shameless plug, I've been working on a library for a while called fotelo (Formatted Text Loader) that I use to quickly parse large amounts of text based off of delimiter, position, or regex. For a quick string it is overkill, but if you're working with logs or large amounts, it may be just what you need. It works off a control file model similar to SQL*Loader (kind of the inspiration behind it).

Dillie-O
+2  A: 

A very complrehesive library can be found here: FileHelpers

rohancragg
I've tried the FileHelpers since asking this question and I really don't like the delimited parsers.
Will