ansaurus

Question

Answer 1

+3 A:

Trust me you don't want to do this with regex. You want something like Java CSV Library.

fuzzy lollipop 2010-06-09 02:06:38

Yes, I agree. Unfortunately, I'm a lowly developer using a StreamTokenizer based solution that I unfortunately cant just scrap. It would work fine if those inner quotes were escaped however.

2010-06-09 02:26:45

@user361970 - if you have a broken solution that you need to fix, *of course* you can scrap it and do it better. Surely, we cannot be talking about more than 100 lines of code here. If your boss says otherwise, send him to SO so that we can explain to him why it is a bad idea to patch bad code.

Stephen C 2010-06-09 04:16:06

StreamTokenizer is even WORSE

fuzzy lollipop 2010-06-09 13:12:12

Thanks for stating the obvious....

2010-06-09 14:07:07

Answer 2

A:

There are a few zillion libraries to help you parse CSV, but if you're wanting to use a regexp for academic reasons, this may help:

quoted string with escape support. "(\\.|[^\\"])*"
unquoted field: [^",]*
delimiter: , *

I don't use CSV files, so I'm not sure about the 'other csv field' validity (matching 456, for example above), or whether /, */ is the delimiter you want..

At any rate, combining the above will match one field and one delimiter (or end of string):

(quotedstring|unquoted)(delimiter|$)

Jordan Sissel 2010-06-09 02:14:40

Answer 3

A:

I would use a tailored sed expression as

's/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \4 \5 \6/g'

ring bearer 2010-06-09 02:32:45

This might be the way to go in the interm

2010-06-09 03:11:34

how would I modify this for escaping with \ instead of replacing with an empty string?

2010-06-09 03:27:24

Simple`'s/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \\\"\4\\" \5 \6/g'`Note that \\ will cause to print \ and " will print a " around \4Hope that answers it.

ring bearer 2010-06-09 03:57:23

I guess I need to take some sed lessons. I get this testing it in cygwin sed: -e expression #1, char 58: invalid reference \6 on `s' command's RHS

2010-06-09 14:58:01

Formatting has messed up above sed expression.Look at my original answer, keep the regex part as is.. just change \4 as \\\"\4\\"

ring bearer 2010-06-09 15:48:19

Answer 4

A:

Your example is not proper csv:

"123", 456, "701 "B" Street", 910

this should actually be:

"123", 456, "701 ""B"" Street", 910

(There are plenty of variations of csv, of course, but since most of the time people want it for use with excel or access I stick to the microsoft definition.)

Therefore the regex for this can look like:

".+("").+("").+"

The groups (in parentheses) will be your double quotes, and the rest ensures that they are found within another set of quotes.

That covers the find part of your needs. The replace part depends on what you are programming in.

Ricosuave 2010-06-09 02:44:54

Not exactly. In the CSV case, you're looking for a pattern like `([^"]|"")*` : matches tokens made from non-quotes or two quotes. `.+` might match single quotes anyway, and `.+("").+("").+` assumes a too-specific format - it only allows two quotes, and `+` requires characters before, between and after them.

Kobi 2010-06-09 04:32:23

Ahh...right you are. That's what I get for answering so close after dinner. I always get my plusses and asterisks confused...

Ricosuave 2010-06-09 06:20:35

ansaurus

tags:

views:

answers:

Regex to match CSV file nested quotes

related questions