views:

1620

answers:

2

In my home directory I have a folder drupal-6.14 that contains the Drupal platform.

From this directory I use the following command:

find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz

What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.

My question is on the regular expression:

grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'

The expression works to exclude all the folders I want excluded, but I don't quite understand why.

It is a common task using regular expressions to

Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.

I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.

Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.

Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?

-- Update One:

Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:

i.e Is:

'drupal-6.14/(?!sites(?!/all|/default)).*'

equivalent to:

'drupal-6.14/(?=sites(?:/all|/default)).*'

???

-- Update Two:

As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.

A: 

Lookarounds can be nested.

So this regex matches "drupal-6.14/" that is not followed by "sites" that is not followed by "/all" or "/default".

Confusing? Using different words, we can say it matches "drupal-6.14/" that is not followed by "sites" unless that is further followed by "/all" or "/default"

ʞɔıu
Thanks for this. And *yes* I do still find it confusing LOL.I think you're quote of "not followed by sites *unless* followed by all|default" is quite helpful.
themesandmodules
+4  A: 

A negative lookahead says, at this position, the following regex can not match.

Let's take a simplified example:

a(?!b(?!c))

a      Match: (?!b) succeeds
ac     Match: (?!b) succeeds
ab     No match: (?!b(?!c)) fails
abe    No match: (?!b(?!c)) fails
abc    Match: (?!b(?!c)) succeeds

The last example is a double negation: it allows a b followed by c. The the nested negative lookahead becomes a positive lookahead: the c should be present.

In each example, only the a is matched. The lookahead is only a condition, and does not add to the matched text.

Andomar
If a nested negative lookahead ("double negative lookahead") can become a positive lookahead, is it possible to state an equivalent in positive lookahead form?i.e:(a) What would be the positive lookahead form of my double negative lookahead drupal "'drupal-6.14/(?!sites(?!/all|/default)).*'" example?Would it be: 'drupal-6.14/(?=sites/all|default).* ???(b) What would be the positive lookahead form of your double negative lookahead "(!?b(?!c))" example?
themesandmodules
eww. sorry. first time using comments here that formatting is horrible. ill restate by editing the question.
themesandmodules
@willieseabrook: Don't think so, only part of the lookahead is double negative, so you can't replace the whole with a positive one
Andomar
Just FYI, there were some typos in the regexes; negative lookaheads are always `(?!...)`, not `(!?...)`.
Alan Moore