tags:

views:

277

answers:

6

This is one of the toughest things I have ever tried to do. Over the years I have searched but I just can't find a way to do this - match a string not surrounded by a given char like quotes or greater/less than symbols.

A regex like this could match URL's not in HTML links, SQL table.column values not in quotes, and lots of other things.

Example with quotes: 
Match [THIS] and "something with [NOT THIS] followed by" or even [THIS].

Example with <,>, & " 
Match [URL] and <a href="[NOT URL]">or [NOT URL]</a>

Example with single quotes: 
WHERE [THIS] LIKE '%[NOT THIS]'

Basically, how do you match a string (THIS) when it is not surrounded by a given char?

\b(?:[^"'])([^"']+)(?:[^"'])\b

Here is a test pattern, a regex like what I am thinking of would only match the first "quote".

To quote; "quote me not least I quote you!".

+2  A: 

It is a bit tough. There are ways, as long as you don't need to keep track of nesting. For instance, let's avoid quoted stuff:

^((?:[^"\\]|\\.|"(?:[^"\\]|\\.)*")*?)THIS

Or, explaining:

^     Match from the beginning
(     Store everything from the beginning in group 1, if I want to do replace
    (?:  Non-grouping aggregation, just so I can repeat it
        [^"\\]  Anything but quote or escape character
        |       or...
        \\.     Any escaped character (ie, \", for example)
        |       or...
        "       A quote, followed by...
        (?:     ...another non-grouping aggregation, of...
            [^"\\]  Anything but quote or escape character
            |       or...
            \\.     Any escaped character
        )*      ...as many times as possible, followed by...
        "       A (closing) quote
    )*?  As many as necessary, but as few as possible
)     And this is the end of group 1
THIS  Followed by THIS

Now, there are other ways of doing this, but, perhaps, not as flexible. For instance, if you want to find THIS, as long as there wasn't a preceeding "//" or "#" sequence -- in other words, a THIS outside a comment, you could do it like this:

(?<!(?:#|//).*)THIS

Here, (?<!...) is a negative look-behind. It won't match these characters, but it will test that they do not appear before THIS.

As for any arbitrarily nested structures -- n ( closed by n ), for example -- they can't be represented by regular expressions. Perl can do it, but it's not a regular expression.

Daniel
It's possible if *n* is finite (and practical if *n* is small), but not if the nesting can be arbitrarily deep.
Cirno de Bergerac
That's pedantic, but so be it. Fixed.
Daniel
+1  A: 

Well, regular expressions are just the wrong tool for this, so it is quite natural that it is hard.

Things "surrounded" by other things are not valid rules for regular grammars. Most (one could perhaps say, all serious) markup and programming languages are not regular. As long as there is no nesting involved, you may be able to simulate a parser with a regex, but be sure to understand what you are doing.

For HTML/XML, just use an HTML resp. XML parser; those exist for almost any language or web framework; using them typically involves just a few lines of code. For tables, you might be able to use a CSV parser, or, at a pinch, roll your own parser that extracts the parts inside/outside quotes. After extracting the parts you are interested in, you can use simple string comparison or regular expressions to get your results.

Svante
+1 just what I was going to point out. Basically this problem is hard just like drilling a hole with a hammer is hard.
cletus
"Surrounded" is quite a valid rule for regular languages. Nesting and unnesting, that is not.
Daniel
@Daniel: Valid rules in (right) regular grammars are only those rules that have exactly one non-terminal on the left hand side, and either the empty string, or a terminal, or a terminal followed by a nonterminal on the right hand side.
Svante
Q ::= "S; S ::= aE; S ::= bE; ...; S ::= zE; E ::= " -- there you have, a lowercase letter surrounded by quotes, in a regular grammar. Was there anything else?
Daniel
This is what I meant with "simulating a parser".
Svante
+2  A: 

The best solution will depend on what you know about the input. For example, if you're looking for things that aren't enclosed in double-quotes, does that mean double-quotes will always be properly balanced? Can they be escaped by with backslashes, or by enclosing them in single-quotes?

Assuming the simplest case--no nesting, no escaping--you could use a lookahead like this:

preg_match('/THIS(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/')

After finding the target (THIS), the lookahead basically counts the double-quotes after that point until the end of the string. If there's an odd number of them, the match must have occurred inside a pair of double-quotes, so it's not valid (the lookahead fails).

As you've discovered, this problem is not well suited to regular expressions; that's why all of the proposed solutions depend on features that aren't found in real regular expressions, like capturing groups, lookarounds, reluctant and possessive quantifiers. I wouldn't even try this without possessive quantifiers or atomic groups.

EDIT: To expand this solution to account for double-quotes that can be escaped with backslashes, you just need to replace the parts of the regex that match "anything that's not a double-quote":

[^"]

with "anything that's not a quote or a backslash, or a backslash followed by anything":

(?:[^"\\]|\\.)

Since backslash-escape sequences are relatively rare, it's worthwhile to match as many unescaped characters as you can while you're in that part of the regex:

(?:[^"\\]++|\\.)

Putting it all together, the regex becomes:

'/THIS\d+(?=(?:(?:(?:[^"\\]++|\\.)*+"){2})*+(?:[^"\\]++|\\.)*+$)/'

Applied to your test string:

'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" ' +
'but \"THIS6\" is good and \\\\"NOT THIS7\\\\".'

...it should match 'THIS1', 'THIS3', 'THIS4' and 'THIS6'.

Alan Moore
This was a great start on the subject, but I'm afraid that it only looks for "THIS" when it is three quotes (") away from the end of the string.
Xeoncross
Oops! I left out a set of parentheses. Try it now.
Alan Moore
Very impressive. With support for an escape char this might be enough! preg_match_all('/[^"]+(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/', $string, $matches);
Xeoncross
Bravo. You built a regex that can do handle escaped chars. I must admit that because of these I dare say that you are the best regex guy I have ever met. Over the years other have just laughed and shaken their heads when faced with this problem.
Xeoncross
A: 

After thinking about nesting elements ("a "this and "this"") and backslashed items "\"THIS\"" it seems that it really is true that this isn't a job for regex. However, the only thing that I can think of to solve this problem would be a regex like char-by-char parser that would mark $quote_level = ###; when finding and entering into a valid quote or sub quote. This way while in that part of the string you would know whether you were inside any given character even if it is escaped by a slash or whatever.

I guess with a char-by-char parser like this you could mark the string position of start/end quotes so that you could break up the string by quote segments and only process those outside the quotes.

Here is an example of how this parser would need to be smart enough to handle nested levels.

Match THIS and "NOT THIS" but THIS and "NOT "THIS" or NOT THIS" but \"THIS\" is good.

//Parser "greedy" looking for nested levels
Match THIS and "
      NOT THIS"
       but THIS and "
         NOT "
          THIS"
           or NOT THIS"
             but \"THIS\" is good

//Parser "ungreedy" trying to close nested levels
Match THIS and "     " but THIS and " " THIS "   " but \"THIS\" is good.
       NOT THIS    NOT     or NOT THIS


//Parser closing levels correctly.
Match THIS and "     " but THIS and "     " but \"THIS\" is good.
       NOT THIS    NOT " " or NOT THIS
              THIS
Xeoncross
+1  A: 

See Text::Balanced for Perl and the Perl FAQ.

Sinan Ünür
That looks like what I might be looking for - but I was hoping for something I could use in PHP...
Xeoncross
A: 

As Alan M pointed out, you can use regex to look for an odd number thereby informing you of your position inside or outside any given string. Taking the quotes example, we seem really close to a solution to this problem. The only thing left is to handle escaped quotes. (I'm positive that nested quotes is almost impossible).

$string = 'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" but \"THIS6\" is good and \\\\"NOT THIS7\\\\".';


preg_match_all('/[^"]+(?=(?:(?:(?:[^"\\\]++|\\\.)*+"){2})*+(?:[^"\\\]++|\\\.)*+$)/', $string, $matches);

Array (
        [0] => Match THIS1 and 
        [1] =>  but THIS3 and 
        [2] => THIS4
        [3] =>  but 
        [4] => THIS6
        [5] =>  is good and \\
        [6] => NOT THIS7\
        [7] => .
    )
Xeoncross
I've expanded my answer to deal with escaped quotes.
Alan Moore