tags:

views:

86

answers:

5

hi all,

i'm writing a little app for spam-checking and i'm having problems with a regex.

let's say i'm having this spam-url:

http://hosting.tyumen.ru/tip.html

so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".

here's what i got so far:

<a href="(http://.*?\..*?..*?/.*?.html)"&gt;(http://.*?\..*?..*?/.*?.html)&lt;/a&gt;

might look like rubbish but it works - the problem: it's really slow and freezing my app.

any hints on how to optimize it? thx.re

A: 
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#
Mez
What happens if the host or domain name contains numbers?
Fredrik Mörk
good point - editing
Mez
+2  A: 

Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.

It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.

bth
Or, if you're using Emacs, re-builder
PP
Another good one is RegexBuddy, though not freeware (but not expensive either by any stretch).
Matthew Scharley
+1 matthew, nothing like regex buddy!
David Hedlund
+2  A: 

(http://[\w.-]+/.+?\.html) - may be will work for your case only.

or may be faster one

(http://[\w.-]+/[^.]+\.html)
S.Mark
This matches almost any valid URL ending in ".html" and is slow for long documents with URLs not ending in ".html". I don't think this answers the question at all. I don't understand why it has been accepted as the best answer?
Mark Byers
Update: the second version is a little faster, but it still matches far more than was asked for and it is still will be slow for URLS without an extension (this is quite common in the real world - see stackoverflow for example).
Mark Byers
As I have mentioned, it will work only for his case only.Do you really think `http://[^"]+\.[^"]+\.[^"]+/[^"]+.html` is faster than `http://[\w\d.-]+/[^.]+\.html`? I really dont think so.
S.Mark
I am just improving his regex, ok? of coz it will not match on stackoverflow, his original regex hardcoded as only .html already
S.Mark
Your regular expression can still be slow in many real world situations, it matches far more than what he asked for (almost every valid URL ending HTML) and it can even match two seperate links like this: <a href="http://localhost/">This one is safe</a> but this is spam: <a href="http://foo.bar.com/baz.html">http://foo.bar.com/baz.html</a>. Your regex will capture both links and the intervening text in one match.
Mark Byers
That comment got mangled. There is not room to discuss it here, sorry.
Mark Byers
for that case, I will suggest him to use html parser.
S.Mark
>> can still be slow in many real world situations,please show me faster one then. I will be interested to benchmark it.
S.Mark
Isn't `\w\d` redundant? `\w` includes `\d` too, right?
Amarghosh
`<a href=http://google.com>http://google.com</a>` possible actually, `[^"]+` will even match `.html`, because its greedy match.
S.Mark
+5  A: 

The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking

Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":

<a href="(http://[^"]+\.[^"]+\.[^"]+/[^"]+.html)"&gt;\1&lt;/a&gt;

I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.

Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.

David Hedlund
For your subdomain and domain name matching (atleast), I'd be using `[^".]+` to help even more with backtracking.
Matthew Scharley
It doesn't quite work though: it has false matches such as http://localhost/doc_v1.0.0.html. You could use [^/"] instead.
Mark Byers
+1 that's a great point, altho omitting the escape would be quite fatal to that ;) `[^"\.]+`
David Hedlund
@Mark: my prev response was to Matthew, altho +1 for you too; that's a different aspect of pretty much the same error. so to sum it all up: `[^/\."]+`
David Hedlund
+1 for bringing formal content to the question, not just the solution
kurast
just a note: inside [] dot . has no special meaning. [.]+ will not match everything, but only dots.
S.Mark
ah. that was news to me. so my remark to matthew was quite uncalled for then. thanks for pointing out
David Hedlund
A: 

In Python, a simple way to match URLs ending in .html or .htm is to use

url_re = re.compile(
    r'https?://' # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:\S+.html?)+' # ending in .html
     , re.IGNORECASE)

which is a modified version of Django's UrlField regex.

This will match any site ending with .html or .htm. (either localhost, ip, domain).

ikkebr