ansaurus

Question

Regular Expression to match both relative and absolute URLs

Answer 1

+1 A:

(http:/)?(/[\w.]+)+/?

matches these, but maybe you had stricter conditions in mind?

Alex Martelli 2009-06-15 22:33:26

Answer 2

+1 A:

(http:\/)?(\/[\w\.\-]+)+\/?

Similar to Alex's.

Johnny G 2009-06-15 22:37:57

hmmm, apart from your more thorough escaping, it seems to me that you're matching 'http:/' all on its own -- you sure you want that * instead of my + there?

Alex Martelli 2009-06-15 22:44:07

This one seems far far far too greedy. I endedup with about 500k of whitespace.

FlySwat 2009-06-15 22:44:33

@Alex Good call. Edited.

Johnny G 2009-06-15 23:46:15

@FlySwat With your updated purpose in mind, I can see how it would be way too greedy.

Johnny G 2009-06-15 23:51:07

Answer 3

+1 A:

That's a tricky one because there are so many valid characters in URL's (before they get url encoded).

Here's my shot:

(http:/|https:/)?(/[^\s"'<>]+)+/?

Also similar to Alex's. The only problem I found with Alex's is that it wouldn't match things like pound signs, dashes, stuff like that. Whereas mine will match all of that.

EDIT -- In fact the only thing that keeps it from being too greedy is the instruction to NOT match whitespace, quotes, apostrophes, or chevrons.

Steve Wortham 2009-06-15 22:42:53

Too Greedy: http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd"><html

FlySwat 2009-06-15 22:47:19

Oh... gotcha. I just updated it. Perhaps that's better.

Steve Wortham 2009-06-15 22:52:19

Answer 4

+1 A:

(
  ((http|https|ftp)://([\w-\d]+\.)+[\w-\d]+){0,1}  // Capture domain names or IP addresses
  (/[\w~,;\-\./?%&+#=]*)                // Capture paths, including relative
)

Rationale for this answer:

The whole thing is grouped so you can pick out the entire URL
The protocol portion is optional, but if provided, a hostname or IP address should also be provided (both of which have fewer allowed characters than the rest of the URI).
The "/" at the beginning is also optional. Paths can be in the form "images/1.gif", which are relative to the current path rather than relative to the hostname.

Caveats:

mailto and file URIs not supported.
URLs trailed by a period (such as at the end of a sentence without quotation) will include the trailing period.
Because of #3 above, it's going to capture all sorts of things. If you can verify that all paths are not relative, you can add a "/" outside the parenthesis and thus require it.
If all URIs are within HTML attributes (A, LINK, IMG, etc.), you can target the URIs much more accurately by only capturing within quotes, or at least only within HTML tags.

Edit: whoops, fixed closing paren problem.

richardtallent 2009-06-15 22:53:37

Check your expression, even after removing the comments it wont compile.

FlySwat 2009-06-15 23:09:54

Steve Wortham 2009-06-15 23:52:39

Answer 5

+1 A:

Not easy and you maybe end up having "too much URI" catched, however what about:

((http://|https://)([^/])+)*(/([^\s])*(/))(((\w)*\.[\w]{3,10})|(\w+))?

Basically you have a couple of groups there. On defining the protocol. One is looking for the directory and one is looking for a file at the end. But! this approach is very limited. If you need a real URI validation and! separation (port, username, password, filter out unwanted characters!) you will probably end up with a way more complex expression. Good luck!

Update:

You didn't asked for this, however for those guys coming from search engines wanting to learn more about regex I would like to plug this free program I used for this attempt "The Regex Coach" (Nope, not affiliated).

merkuro 2009-06-15 22:53:51

This seems to get the closest at actually matching what I want, however what is "captured" is typically just "/" and not the whole URI.

FlySwat 2009-06-15 23:09:23

Because even www.google.com, cool.com or www.test.us/test could be seen as a valid URI from a certain perspective I suggest the following expression: ([^\s]+[/.]+[^\s]+) It will fail on regular build sentences as long as there is a whitespace afterward.

merkuro 2009-06-16 00:17:21

Answer 6

A:

I used naming capture groups. We get better matches when the scheme is present. Like www.foo.com/bar would only match /bar.

(?:
  (?:(?<scheme>https?|file)://)
  (?<host>[^/]+)
  (?<path>/(?:[^\s])+)?
)
|
(?<path>/(?:[^\s])+)

This is what you could do for javascript

var result = text.match(/(?:(?:(https?|file):\/\/)([^\/]+)(\/(?:[^\s])+)?)|(\/(?:[^\s])+)/g);

Test data

sadfasdf /foo/bar/ba090z.gif asdfasdf /foo/bar/ sadfasdf asdflkj; http://www.foo.com/foo/bar some stuff http://user:[email protected]:80/r?stuff%20stuff

user:[email protected]:80/r?stuff%20stuff

TJ 2009-06-16 03:30:50

ansaurus

tags:

views:

answers:

Regular Expression to match both relative and absolute URLs

related questions