views:

278

answers:

4

Hi,

I'm a relative novice to regular expressions (although I've used them many times successfully). I want to find all links in a document that do not end in ".html" The regular expression I came up with is:

href=\"([^"]*)(?<!html)\"

In Notepad++, my editor, href=\"([^"]*)\" finds all the links (both those that end in "html" and those that do not). Why doesn't negative lookbehind work?

I've also tried lookahead:

href=\"[^"]*(?!html\")

but that didn't work either.

Can anybody help?

Cheers, grovel

+1  A: 

Edit: Notepad++ using SciTE regular expression engine and it does not support look around expressions.

For more info take a look here http://www.scintilla.org/SciTERegEx.html


Original Answer

^.*(?<!\.html)$

S.Mark
+1  A: 

That regular expression would work fine, if you were using PERL or PCRE (e.g. preg_match in PHP). However, lookahead and lookbehind assertions are not supported by most, especially the more simple, regular expression engines, like one that is used by the Notepad++. Only the most basic syntax such as quantifiers, subpatterns and characters classes are supported by almost all regular expression engines.

You can find the documentation for the notepad++ regular expression engine at: http://notepad-plus.sourceforge.net/uk/regExpList.php

Rithiur
+1  A: 

You can make a regexp that does it, but it would probably be too complex:

href=\"((([^"]*)([^h"][^"][^"][^"]|[^t"][^"][^"]|[^m"][^"]|[^l]))|([^"]|)([^"]|)([^"]|))\"
jpalecek
A: 

Thank you all very much.

In the end the regular expression did indeed not work.

I simply used a workaround, and replaced all links with themselves+".html", then replaced all occurences of ".html.html" with ".html".

So I replaced href=\"([^"]*)\" with href="\1.html" and then .html.html with .html

Thanks anyway, grovel

grovel