ansaurus

Question

regex : how to eliminiate urls ending with .dtd

Answer 1

+3 A:

The nicest way to do it is to use a negative lookbehind (in languages that support them):

/(?>http:\/\/[^\s]*)(?<!\.dtd)/g

The ?> in the first bracket makes it an atomic grouping which stops the regex engine backtracking - so it'll match the full URL as it does now, and if/when the next part fails it won't try going back and matching less.

The (<!\.dtd) is a negative lookbehind, which only matches if \.dtd doesn't match ending at that position (i.e., the URL doesn't end in .dtd).

For languages that don't (such as JavaScript), you can do a negative lookahead instead, which is a bit more ugly and is generally less efficient:

/(http:\/\/(?![^\s]*\.dtd\b)[^\s]*)/g

Will match http://, then scan ahead to make sure it doesn't end in .dtd, then backtrack and scan forward again to get the actual match.

As always, http://www.regular-expressions.info/ is a good reference for more information

Chris Smith 2010-03-31 12:55:36

getting syntax error because of <pre><</pre> in (?<!\.dtd)

Nadal 2010-03-31 13:02:36

Hmm. It's possible JavaScript doesn't support lookbehinds then. In that case, I can't think of a nice way you can do it with a single regexp - your best bet is just to use what you have now, loop through the results and manually remove any that end in ".dtd".

Chris Smith 2010-03-31 13:06:43

I know JavaScript does not support look behind. However it does support positive and negative lookahead.

Nadal 2010-03-31 13:10:57

Or maybe there is: `/(http:\/\/(?![^\s]*\.dtd\b)[^\s]*)/g`. It's not as nice or efficient as the look-behind one, but seems to do the trick.

Chris Smith 2010-03-31 13:11:46

perfect. nicely done. thanks.

Nadal 2010-03-31 13:18:26

ansaurus

tags:

views:

answers:

regex : how to eliminiate urls ending with .dtd

related questions