ansaurus

Question

Funny CSV format help

Answer 1

A:

Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.

dwelch 2009-08-18 04:45:08

Yes, but the standard for CSV is to wrap the *whole* field in quotes if that field contains the separator character. Escaping the character like that is non-standard and I'm hoping my parsing method will be enough to handle it.

ChristianLinnell 2009-08-18 04:47:01

Answer 2

+2 A:

Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.

On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.

For example:

Pax;is;a;good;guy";" so;says;his;wife.

becomes:

0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.

Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).

0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.

paxdiablo 2009-08-18 04:45:46

I imagine half the world's software is based on non-duplication of GUIDs. Plus, the fields are not intended for GUIDs so every-day non-experimental/malicious behaviour will not result in GUIDs being entered.

ChristianLinnell 2009-08-18 04:49:39

Answer 3

+1 A:

In pseudo-code, given:

input: A string, first character is input[0]; last character is input[length]. Further, assume one dummy character, input[length+1]. It can be anything except ; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input

Do this:

set start = 0
if input[0] = ';':
- you have a blank field in the beginning; do whatever with it
- set start = 2
endif
for each c between 1 and length:
- next iteration unless string[c] = ';'
- if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
  - found field consting of half-open range [start,c); do whatever with it. Note that in the case of empty fields, start≥c, leaving an empty range
  - set start = c+1
- endif
end foreach

Untested, of course. Debugging code like this is always fun….

The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].

derobert 2009-08-18 04:49:04

Oh, that's very nice.

ChristianLinnell 2009-08-18 04:52:21

Answer 4

A:

If you're using SQL Server you could try using T-SQL to handle everything for you.

SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0', 
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]

That will create and populate "MyTable". Read more on this subject here on SO.

Mr. Smith 2009-08-18 04:51:51

Answer 5

A:

I would recommend using RegEx to break up the strings.

Find every ';' that is not a part of ";" and change it to something else that does not appear in your fields.
Then go through and replace ";" with ;

Now you have your fields with the correct data.

Most importers can swap out separator characters pretty easily.

This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).

Jacob

TheJacobTaylor 2009-08-18 04:59:44

Answer 6

A:

One option would be to find instances of the regex:

[^"];[^"]

and then break the string apart with substring:

List<string> ret = new List<string>();
Regex r = new Regex(@"[^""];[^""]"); 
Match m;

while((m = r.Match(line)).Success)
{
    ret.Add(line.Substring(0,m.Index + 1);
    line = line.Substring(m.Index + 2);
}

(Sorry about the C#, I don't known VBScript)

BCS 2009-08-18 16:46:48

ansaurus

tags:

views:

answers:

Funny CSV format help

related questions