tags:

views:

614

answers:

5

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?

Not a match:

"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"

Match:

"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
A: 
".*"(\n|(".*",)*)

should work, I guess...

Aviral Dasgupta
Note that `.*` is greedy and may eat up any number of characters, including some `"`, so that for instance `"adsf", "", "asdf"`would also be a match. (And the performance may be pretty bad because it matches way to much even in the case of a non-match.)
Lucero
A: 

For single-line matches:

^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"

or for multi-line:

(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"

Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.

Lucero
+1  A: 

Try this:

(?m)""(?![ \t]*(,|$))

Explanation:

(?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
""         // match two successive double quotes
(?!        // start negative look ahead
  [ \t]*   //   zero or more spaces or tabs
  (        //   open group 1
    ,      //     match a comma 
    |      //     OR
    $      //     the end of the line or string
  )        //   close group 1
)          // stop negative look ahead

So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".

(i) besides being the normal start-of-the-string and end-of-the-string meta characters.

Bart Kiers
Uh, won't this match ` "asdf", "" ` (should not match) but not ` "asdf""asdf", "asdf" ` (should match)?
Lucero
@Lucero: no, exactly the opposite. It does NOT match the double quotes in `"asdf", ""`, and it DOES match the double quotes in `"asdf""asdf", "asdf"`.
Bart Kiers
How would the expression change if we were to find non-successive double quotes like "asdf "something" asdf ", "asdf", ...
stevenjmyu
A: 

Try this regular expression:

"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"

That will match any quoted string with at least one pair of unescaped double quotes.

Gumbo
+1  A: 

Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.

My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.

NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.

Here is my code:

orgTexts = [
    '"asdf","asdf"',
    '"", "asdf"',
    '"asdf", ""',
    '"adsf", "", "asdf"',
    '"asdf""asdf", "asdf"',
    '"asdf", """asdf"""',
    '"asdf", """"'
]

orgTexts.each{|orgText|
    # Preprocessing - Eliminate spaces before and after comma
    # Here is needed if you may have spaces before and after a valid comma
    orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')

    # Detect valid character (non-quote and valid quote)
    resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
    # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
    # [^\"]       ===> A non qoute
    # |           ===> or
    # ^\"         ===> beginning quot
    # |           ===> or
    # \"$         ===> endding quot
    # |           ===> or
    # (?<=,)\"    ===> quot just after comma
    # \"(?=,)     ===> quot just before comma
    # (?<=\\\\)\" ===> escaped quot

    #  This part is to show the invalid non-escaped quots
    print orgText
    print resText.gsub(Regexp.new('"'), '^')

    # This part is to determine if there is non-escaped quotes
    # Here is the actual matching, use this one if you don't want to know which quote is un-escaped
    isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
    # Basicall, it match it from start to end (^...$) there is only a valid character

    print orgText + ": " + isMatch
    print 
    print ""
    print ""
} 

When executed the code prints:

"asdf","asdf"
-------------
"asdf","asdf": false


"","asdf"
---------
"","asdf": false


"asdf",""
---------
"asdf","": false


"adsf","","asdf"
----------------
"adsf","","asdf": false


"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true


"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true


"asdf",""""
--------^^-
"asdf","""": true

I hope I give you some idea here that you can use with other engine and language.

NawaMan