tags:

views:

275

answers:

2

In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)

I have the following regex to match all URLs for the domain.

\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?

I just can't get the last part to work

(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?

So the following SHOULD match

example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo

But these should NOT

https://www.anotherexample.com#link
www.example.com?refer=twitter

EDIT: And if you can get it to match the

http://example.com?foo=foo.bar

out of a sentence like

For examples go to http://example.com?foo=foo.bar.

without picking up the period, that would be great!

EDIT2: Fixed the trailing period issue with this

\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?

EDIT3: This seems to work, or at least 99% of the tests I've thrown at it

(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?

EDIT4: Settled on

\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+
+1  A: 
(?!\b.*[&?]refer=twitter)

Is what you're looking for.

Stefan Kendall
Chad
A: 

To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice

>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}

You can do anything from here.

shylent
For the record, I didn't down vote you. I personally don't see why people think regex's are a last resort and dislike them. I find them to be a good solution, simpler to write, and less error prone than lots of string parsing and manipulation to complete a task, but that's just my opinion.
Chad
Well, you are contradicting yourself. First you ask a question about how you can't get some regexes to work, but then you say, they are less error-prone. On the other hand, I don't see how one could possibly get wrong two calls to the standard library function (as in my example). I am not saying, that regular expressions should not be learned, not at all. In fact my attitude towards them stems from my experience (sometimes good, sometimes bad).
shylent
"Well, you are contradicting yourself. First you ask a question about how you can't get some regexes to work, but then you say, they are less error-prone." That is not contradictory at all. There is one point of failure in a regex. The regex itself, and it's purpose is to parse and validate strings of data. Why would I want to recreate that in several lines of code? It's much more likely I write flawed code, than a flawed regex.
Chad
I don't know about you, I've been a regex fanatic for quite some time. Then I've started writing unit tests.
shylent
And what does unit tests have to do with using a regex or not? You can write unit tests for regexes too.
Chad