tags:

views:

1049

answers:

4

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:

(http://)([a-zA-Z0-9\/\.])*

If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?

+3  A: 

You could simply search and replace everything that matches the regular expression with an empty string, e.g. in Perl s/(http:\/\/)([a-zA-Z0-9\/\.])*//g

This would give you everything in the original text, except those substrings that match the regular expression.

dmcer
+1  A: 

If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.

s/^(.*)(your regex here)(.*)$/$1$3/
Rob Di Marco
That will only delete one match: the last one. And very inefficiently, too.
Alan Moore
A: 

im not sure if this will work exactly as you intend but it might help: Whatever you place in the brackets [] will be matched against. If you put ^ within the bracket, i.e [^a-zA-Z0-9\/.] it will match everything except what is in the brackets.

http://www.regular-expressions.info/

superjadex12
+1  A: 

If for some reason you need a regex-only solution, try this:

((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)

I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.

The regex is a bit of a monster, so I'll try to break it down:

(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])

The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.

We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)

Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.

Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.

The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.

This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.

Corrections welcome, of course!

WCWedin
I see you took this as the accepted answer. Thanks! I want to reiterate my warning about complex regexs, though. Document it well if you or someone else is going to be reading the code later.Also, [a-zA-Z0-9\/\.#?/%] can be changed to [a-zA-Z0-9/.#?%] -- you (usually) don't need to escape symbols inside a character group. You might also want to try https?:// instead of http://, and possibly other protocols as well, depending on your requirements. Don't forget to check out http://www.regular-expressions.info/, as superjadex12 suggested.
WCWedin