ansaurus

Question

Answer 1

A:

#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#

Mez 2009-11-25 11:32:38

What happens if the host or domain name contains numbers?

Fredrik Mörk 2009-11-25 11:35:22

good point - editing

Mez 2009-11-25 11:48:57

Answer 2

+2 A:

Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.

It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.

bth 2009-11-25 11:34:40

Or, if you're using Emacs, re-builder

PP 2009-11-25 11:38:01

Another good one is RegexBuddy, though not freeware (but not expensive either by any stretch).

Matthew Scharley 2009-11-25 11:42:29

+1 matthew, nothing like regex buddy!

David Hedlund 2009-11-25 12:18:21

Answer 3

+2 A:

(http://[\w.-]+/.+?\.html) - may be will work for your case only.

or may be faster one

(http://[\w.-]+/[^.]+\.html)

S.Mark 2009-11-25 11:35:35

This matches almost any valid URL ending in ".html" and is slow for long documents with URLs not ending in ".html". I don't think this answers the question at all. I don't understand why it has been accepted as the best answer?

Mark Byers 2009-11-25 11:57:15

Update: the second version is a little faster, but it still matches far more than was asked for and it is still will be slow for URLS without an extension (this is quite common in the real world - see stackoverflow for example).

Mark Byers 2009-11-25 11:58:56

As I have mentioned, it will work only for his case only.Do you really think `http://[^"]+\.[^"]+\.[^"]+/[^"]+.html` is faster than `http://[\w\d.-]+/[^.]+\.html`? I really dont think so.

S.Mark 2009-11-25 12:00:42

I am just improving his regex, ok? of coz it will not match on stackoverflow, his original regex hardcoded as only .html already

S.Mark 2009-11-25 12:12:57

Your regular expression can still be slow in many real world situations, it matches far more than what he asked for (almost every valid URL ending HTML) and it can even match two seperate links like this: <a href="http://localhost/">This one is safe</a> but this is spam: <a href="http://foo.bar.com/baz.html">http://foo.bar.com/baz.html</a>. Your regex will capture both links and the intervening text in one match.

Mark Byers 2009-11-25 12:15:00

That comment got mangled. There is not room to discuss it here, sorry.

Mark Byers 2009-11-25 12:16:05

for that case, I will suggest him to use html parser.

S.Mark 2009-11-25 12:20:59

>> can still be slow in many real world situations,please show me faster one then. I will be interested to benchmark it.

S.Mark 2009-11-25 12:22:53

Isn't `\w\d` redundant? `\w` includes `\d` too, right?

Amarghosh 2009-11-25 12:23:13

`<a href=http://google.com>http://google.com</a>` possible actually, `[^"]+` will even match `.html`, because its greedy match.

S.Mark 2009-11-25 12:27:08

Answer 4

+5 A:

The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking

Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":

<a href="(http://[^"]+\.[^"]+\.[^"]+/[^"]+.html)"&gt;\1&lt;/a&gt;

I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.

Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.

David Hedlund 2009-11-25 11:40:42

For your subdomain and domain name matching (atleast), I'd be using `[^".]+` to help even more with backtracking.

Matthew Scharley 2009-11-25 11:45:30

It doesn't quite work though: it has false matches such as http://localhost/doc_v1.0.0.html. You could use [^/"] instead.

Mark Byers 2009-11-25 11:49:41

+1 that's a great point, altho omitting the escape would be quite fatal to that ;) `[^"\.]+`

David Hedlund 2009-11-25 11:49:44

@Mark: my prev response was to Matthew, altho +1 for you too; that's a different aspect of pretty much the same error. so to sum it all up: `[^/\."]+`

David Hedlund 2009-11-25 11:51:36

+1 for bringing formal content to the question, not just the solution

kurast 2009-11-25 11:52:46

just a note: inside [] dot . has no special meaning. [.]+ will not match everything, but only dots.

S.Mark 2009-11-25 12:11:42

ah. that was news to me. so my remark to matthew was quite uncalled for then. thanks for pointing out

David Hedlund 2009-11-25 12:17:39

Answer 5

A:

In Python, a simple way to match URLs ending in .html or .htm is to use

url_re = re.compile(
    r'https?://' # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:\S+.html?)+' # ending in .html
     , re.IGNORECASE)

which is a modified version of Django's UrlField regex.

This will match any site ending with .html or .htm. (either localhost, ip, domain).

ikkebr 2009-11-25 11:49:02

ansaurus

tags:

views:

answers:

Regex Problem (newbie)

related questions