tags:

views:

207

answers:

4

Hi,

I know this has been discussed a million times. I tried searching through the forums and have seen some close regex expressions and tried to modify them but to no avail.

Say there is a line in a csv file like this:

"123", 456, "701 "B" Street", 910

Is there an easy regex to detect "B" (since its a non-escaped set of quotes within the normal CSV quotes) and replace it with something like \"B\" ? The final string would end up looking like this:

"123", 456, "701 \"B\" Street", 910

Help would be greatly appreciated!

+3  A: 

Trust me you don't want to do this with regex. You want something like Java CSV Library.

fuzzy lollipop
Yes, I agree. Unfortunately, I'm a lowly developer using a StreamTokenizer based solution that I unfortunately cant just scrap. It would work fine if those inner quotes were escaped however.
@user361970 - if you have a broken solution that you need to fix, *of course* you can scrap it and do it better. Surely, we cannot be talking about more than 100 lines of code here. If your boss says otherwise, send him to SO so that we can explain to him why it is a bad idea to patch bad code.
Stephen C
StreamTokenizer is even WORSE
fuzzy lollipop
Thanks for stating the obvious....
A: 

There are a few zillion libraries to help you parse CSV, but if you're wanting to use a regexp for academic reasons, this may help:

  • quoted string with escape support. "(\\.|[^\\"])*"
  • unquoted field: [^",]*
  • delimiter: , *

I don't use CSV files, so I'm not sure about the 'other csv field' validity (matching 456, for example above), or whether /, */ is the delimiter you want..

At any rate, combining the above will match one field and one delimiter (or end of string):

(quotedstring|unquoted)(delimiter|$)
Jordan Sissel
A: 

I would use a tailored sed expression as

's/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \4 \5 \6/g'
ring bearer
This might be the way to go in the interm
how would I modify this for escaping with \ instead of replacing with an empty string?
Simple`'s/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \\\"\4\\" \5 \6/g'`Note that \\ will cause to print \ and " will print a " around \4Hope that answers it.
ring bearer
I guess I need to take some sed lessons. I get this testing it in cygwin sed: -e expression #1, char 58: invalid reference \6 on `s' command's RHS
Formatting has messed up above sed expression.Look at my original answer, keep the regex part as is.. just change \4 as \\\"\4\\"
ring bearer
A: 

Your example is not proper csv:

"123", 456, "701 "B" Street", 910

this should actually be:

"123", 456, "701 ""B"" Street", 910

(There are plenty of variations of csv, of course, but since most of the time people want it for use with excel or access I stick to the microsoft definition.)

Therefore the regex for this can look like:

".+("").+("").+"

The groups (in parentheses) will be your double quotes, and the rest ensures that they are found within another set of quotes.

That covers the find part of your needs. The replace part depends on what you are programming in.

Ricosuave
Not exactly. In the CSV case, you're looking for a pattern like `([^"]|"")*` : matches tokens made from non-quotes or two quotes. `.+` might match single quotes anyway, and `.+("").+("").+` assumes a too-specific format - it only allows two quotes, and `+` requires characters before, between and after them.
Kobi
Ahh...right you are. That's what I get for answering so close after dinner. I always get my plusses and asterisks confused...
Ricosuave