tags:

views:

342

answers:

9

As stated in the title, is there a way, using regular expressions, to match a text pattern for text that appears outside of quotes. Ideally, given the following examples, I would want to be able to match the comma that is outside of the quotes, but not the one in the quotes.

This is some text, followed by "text, in quotes!"

or

This is some text, followed by "text, in quotes" with more "text, in quotes!"

Additionally, it would be nice if the expression would respect nested quotes as in the following example. However, if this is technically not feasible with regular expressions then it wold simply be nice to know if that is the case.

The programmer looked up from his desk, "This can't be good," he exclaimed, "the system is saying 'File not found!'"

I have found some expressions for matching something that would be in the quotes, but nothing quite for something outside of the quotes.

A: 

Here is an expression that gets the match, but it isn't perfect, as the first match it gets is the whole string, removing the final ".

[^"].*(,).*[^"]

I have been using my Free RegEx tester to see what works.

Test Results

Group Match Collection # 1
Match # 1
Value: This is some text, followed by "text, in quotes!
Captures: 1

Match # 2
Value: ,
Captures: 1
Mitchel Sellers
+3  A: 

Easiest is matching both commas and quoted strings, and then filtering out the quoted strings.

/"[^"]*"|,/g

If you really can't have the quotes matching, you could do something like this:

/,(?=[^"]*(?:"[^"]*"[^"]*)*\Z)/g

This could become slow, because for each comma, it has to look at the remaining characters and count the number of quotes. \Z matches the end of the string. Similar to $, but will never match line ends.

If you don't mind an extra capture group, it could be done like this instead:

/\G((?:[^"]*"[^"]*")*?[^"]*?)(,)/g

This will only scan the string once. It counts the quotes from the beginning of the string instead. \G will match the position where last match ended.


The last pattern could need an example.

Input String: 'This is, some text, followed by "text, in quotes!" and more ,-as'
Matches:
1. ['This is', ',']
2. [' some text', ',']
3. [' and followed by "text, in quotes!" and more ', ',']

It matches the string leading up to the comma, as well as the comma.

MizardX
+2  A: 

This can be done with modern regexes due to the massive number of hacks to regex engines that exist, but let me be the one to post the "Don't Do This With Regular Expressions" answer.

This is not a job for regular expressions. This is a job for a full-blown parser. As an example of something you can't do with (classical) regular expressions, consider this:

()(())(()())

No (classical) regex can determine if those parenthesis are matched properly, but doing so without a regex is trivial:

/* C code */

char string[] = "()(())(()())";
int parens = 0;
for(char *tmp = string; tmp; tmp++)
{
  if(*tmp == '(') parens++;
  if(*tmp == ')') parens--;
}
if(parens > 0)
{
  printf("%s too many open parenthesis.\n", parens);
}
else if(parens < 0)
{
  printf("%s too many closing parenthesis.\n", -parens);
}
else
{
  printf("Parenthesis match!\n");
}

# Perl code

my $string = "()(())(()())";
my $parens = 0;
for(split(//, $string)) {
  $parens++ if $_ eq "(";
  $parens-- if $_ eq ")";
}
die "Too many open parenthesis.\n" if $parens > 0;
die "Too many closing parenthesis.\n" if $parens < 0;
print "Parenthesis match!";

See how simple it was to write some non-regex code to do the job for you?

EDIT: Okay, back from seeing Adventureland. :) Try this (written in Perl, commented to help you understand what I'm doing if you don't know Perl):

# split $string into a list, split on the double quote character
my @temp = split(/"/, $string);

# iterate through a list of the number of elements in our list
for(0 .. $#temp) {

  # skip odd-numbered elements - only process $list[0], $list[2], etc.
  # the reason is that, if we split on "s, every other element is a string
  next if $_ & 1;

  if($temp[$_] =~ /regex/) {
    # do stuff
  }

}

Another way to do it:

my $bool = 0;
my $str;
my $match;

# loop through the characters of a string
for(split(//, $string)) {

  if($_ eq '"') {
    $bool = !$bool;
    if($bool) {

      # regex time!
      $match += $str =~ /regex/;

      $str = "";
    }
  }

  if(!$bool) {

    # add the current character to our test string
    $str .= $_;
  }
}

# get trailing string match
$match += $str =~ /regex/;

(I give two because, in another language, one solution may be easier to implement than the other, not just because There's More Than One Way To Do It™.)

Of course, as your problems grow in complexity, there will arise certain benefits of constructing a full-blown parser, but that's a different horse. For now, this will suffice.

Chris Lutz
It looks like this is the direction I'm going to go, but being able to do this with regular expressions (assuming it was fast) would be nice as I'm still trying to do a regex for something that is outside of the quotes.
Rob
A: 

You should better build yourself a simple parser (pseudo-code):

quoted := False
FOR char IN string DO
    IF char = '"'
        quoted := !quoted
    ELSE
        IF char = "," AND !quoted
            // not quoted comma found
        ENDIF
    ENDIF
ENDFOR
Gumbo
Nice Pascal code there. =P Not trying to be an ass, but I've never seen anyone use := for assignment in psuedocode.
Chris Lutz
A: 

This really depends on if you allow nested quotes or not.

In theory, with nested quotes you cannot do this (regular languages can't count)

In practice, you might manage if you can constrain the depth. It will get increasingly ugly as you add complexity. This is often how people get into grief with regular expressions (trying to match something that isn't actually regular in general).

Note that some "regex" libraries/languages have added non-regular features.

If this sort of thing gets complicated enough, you'll really have to write/generate a parser for it.

simon
For my problem in particular, the nested quotes need to be respected, but I could foresee situations where another developer might need to do this where they don't need to respect the nested quotes, thus the information might be useful for them.
Rob
+1  A: 

As mentioned before, regexp cannot match any nested pattern, since it is not a Context-free language.

So if you have any nested quotes, you are not going to solve this with a regex.
(Except with the "balancing group" feature of a .Net regex engine - as mentioned by Daniel L in the comments - , but I am not making any assumption of the regex flavor here)

Except if you add further specification, like a quote within a quote must be escaped.

In that case, the following:

text before string "string with \escape quote \" still
within quote" text outside quote "within quote \" still inside" outside "
inside" final outside text

would be matched successfully with:

(?ms)((?:\\(?=")|[^"])+)(?:"((?:[^"]|(?<=\\)")+)(?<!\\)")?
  • group1: text preceding a quoted text
  • group2: text within double quotes, even if \" are present in it.
VonC
Doesn't the .net regexp engine support matching things like balanced quotes, parens, etc?
Daniel LeCheminant
http://blog.stevenlevithan.com/archives/balancing-groups
Daniel LeCheminant
@Daniel: true, I have updated my answer accordingly
VonC
PCRE supports recusive regexes, but they're really f***ing confusing and probably still experimental. The simplest I believe is (?R), which is a recusive match of the whole pattern. Hence, matching parens would be /\((?:[^()]|(?R))*\)/ with Perl regexes. But that looks even more hideous than usual.
Chris Lutz
A: 

You need more in your description. Do you want any set of possible quoted strings and non-quoted strings like this ...

Lorem ipsum "dolor sit" amet, "consectetur adipiscing" elit.

... or simply the pattern you asked for? This is pretty close I think ...

(?<outside>.*?)(?<inside>(?=\"))

It does capture the "'s however.

JP Alioto
A: 

Maybe you could do it in two steps?
First you replace the quoted text:

("[^"]*")

and then you extract what you want from the remaining string

fredrik
A: 
,(?=(?:[^"]*"[^"]*")*[^"]*\z)

Regexes may not be able to count, but they can determine whether there's an odd or even number of something. After finding a comma, the lookahead asserts that, if there are any quotation marks ahead, there's an even number of them, meaning the comma is not inside a set of quotes.

This can be tweaked to handle escaped quotes if needed, though the original question didn't mention that. Also, if your regex flavor supports them, I would add atomic groups or possessive quantifiers to keep backtracking in check.

Alan Moore