views:

206

answers:

6

I've been given a large file with a funny CSV format to parse into a database.

The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".

I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.

Now, for parsing it in VBScript I was thinking of

  1. Replacing each instance of ";" with a GUID,
  2. Splitting the line into an array by semicolon,
  3. Running back through the array, replacing the GUIDs with ";"

It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...

A: 

Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.

dwelch
Yes, but the standard for CSV is to wrap the *whole* field in quotes if that field contains the separator character. Escaping the character like that is non-standard and I'm hoping my parsing method will be enough to handle it.
ChristianLinnell
+2  A: 

Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.

On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.

For example:

Pax;is;a;good;guy";" so;says;his;wife.

becomes:

0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.

Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).

0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
paxdiablo
I imagine half the world's software is based on non-duplication of GUIDs. Plus, the fields are not intended for GUIDs so every-day non-experimental/malicious behaviour will not result in GUIDs being entered.
ChristianLinnell
+1  A: 

In pseudo-code, given:

  • input: A string, first character is input[0]; last character is input[length]. Further, assume one dummy character, input[length+1]. It can be anything except ; and ". This string is one line of the "CSV" file.

  • length: positive integer, number of characters in input

Do this:

  • set start = 0

  • if input[0] = ';':

    • you have a blank field in the beginning; do whatever with it

    • set start = 2

  • endif

  • for each c between 1 and length:

    • next iteration unless string[c] = ';'

    • if input[c-1] ≠ '"' or input[c+1] ≠ '"':                     // test for escape sequence ";"

      • found field consting of half-open range [start,c); do whatever with it. Note that in the case of empty fields, start≥c, leaving an empty range

      • set start = c+1

    • endif

  • end foreach

Untested, of course. Debugging code like this is always fun….

The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].

derobert
Oh, that's very nice.
ChristianLinnell
A: 

If you're using SQL Server you could try using T-SQL to handle everything for you.

SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0', 
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]

That will create and populate "MyTable". Read more on this subject here on SO.

Mr. Smith
A: 

I would recommend using RegEx to break up the strings.

  • Find every ';' that is not a part of ";" and change it to something else that does not appear in your fields.
  • Then go through and replace ";" with ;

Now you have your fields with the correct data.

Most importers can swap out separator characters pretty easily.

This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).

Jacob

TheJacobTaylor
A: 

One option would be to find instances of the regex:

[^"];[^"]

and then break the string apart with substring:

List<string> ret = new List<string>();
Regex r = new Regex(@"[^""];[^""]"); 
Match m;

while((m = r.Match(line)).Success)
{
    ret.Add(line.Substring(0,m.Index + 1);
    line = line.Substring(m.Index + 2);
}

(Sorry about the C#, I don't known VBScript)

BCS