views:

98

answers:

4

I need to search for lines in a CSV file that end in an unterminated, double-quoted string.

For example:

1,2,a,b,"dog","rabbit

would match whereas

1,2,a,b,"dog","rabbit","cat bird"
1,2,a,b,"dog",rabbit

would not.

I have very limited experience with regular expressions, and the only thing I could think of is something like

"[^"]*$

However, that matches the last quote to the end of the line.

How would this be done?

+4  A: 

Assuming that the strings cannot contain ", you need to match a string that has an odd number of quotes, like this:

([^"]*("[^"]*")?)*"

Note that this is vulnerable to a DDOS attack.

This will match zero or more sets of unquoted run, followed by quoted strings.

SLaks
Why would it be vulnerable to a DDOS?
Austin Hyde
It's got nested expandos. http://msdn.microsoft.com/en-us/magazine/ff646973.aspx (The other answer is also vulnerable)
SLaks
+4  A: 

Assuming quotes can't be escaped, you need to test the parity of quotes (making sure that there's an even number of them instead of odd). Regular expressions are great for that:

^(([^"]*"){2})*[^"]*$

That will match all lines with an even number of quotes. You can invert the result for all strings with an odd number. Or you can just add another ([^"]*") part at the beginning:

^[^"]*"(([^"]*"){2})*[^"]*$

Similarly, if you have access to reluctant operators instead of greedy ones you can use a simpler-looking expression:

^((.*"){2})*.*$         #even
^.*"((.*"){2})*.*$      #odd

Now, if quotes can be escaped, it's a different question entirely, but the approach would be similar: determine the parity of unescaped quotes.

Welbog
Shouldn't there be some question marks in those last two regexes? But I would recommend against that approach even *with* reluctant quantifiers, for the reason @SLaks mentioned: potential runaway backtracking. Your first approach should be safe because no one part of the regex can match the same characters as a neighboring part--everything matches either a quote or a not-quote.
Alan Moore
@Alan: With respect to question marks, depends on your regex dialect. Some regex dialects use `*?` as the reluctant Kleene closure while others require you to assign flags to the regex to tell the interpreter that Kleene closures are reluctant. Others might consider them reluctant by default and need to be explicitly told to be greedy.
Welbog
I don't know of any regex flavor that treats quantifiers as reluctant by default. PHP has the `U` modifier, which makes them reluctant unless you use the question mark to make them greedy. Many people, myself among them, believe that feature was a mistake, and that users should be strongly discouraged from using it. Whatever benefit it brings is more than canceled out by the confusion it causes.
Alan Moore
A: 

To avoid "nested expandos":

egrep -v '^[^"]*("[^"]*"[^"]*)*[^"]*$' my_file
DVK
That's still a nested expando (A better term would be a nested repetition).
SLaks
Ah. OK. I was reading this as "nested parenthesized stuff".
DVK
+1  A: 

Try this one:

".+[^"](,|$)

This matches a quote (anywhere in the line), followed (greedily) by anything but another quote before the end of the line or a comma.

The net affect is that it will only match lines with dangling quoted strings.

I think it's even immune to 'nested expandos attacks' (we do live in a very dangerous world ...)

Adrian Regan