tags:

views:

2532

answers:

5

I have the following regex that does a great job matching urls:

((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)`

However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?


EDIT: Does my question too vague? Does it need more details?


(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:#@%/;$()~_?\+-=\\\.&]*)

I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.

My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?

A: 

Just use:

.*

i.e. match everything.

The things you want to match are just hostnames, not URL (technically).

There's no structure you can use to definitively identify hostnames. Perhaps you could look for things that end in ".com" but then you'll miss any .co.uk, net, .org, etc.

Edit:

In other words: If you remove the requirement that the URL-like things start with a protocol you won't have any thing to match on. Depending on what you are using the regular expression on:

  1. Treat everything as a URL
  2. Keep the requirement for a protocol
  3. Hack checks for common endings for hostnames (e.g. .com .net .org) and accept you'll miss some.
Douglas Leeder
are you saying to replace the contents of the square brackets with .*?
Anders
no replace the entire regular expression. Or better just remove the regular expression and treat everything as a url.
Douglas Leeder
A: 

Your regexp matches everything starting with one of those protocols, including a lot of things that cannot possibly be existent URLs, if you relax the protocol part (making it optional with ?) then you'll just be matching almost everything, including the empty string.

In other words, it does a great job matching URLs because it matches almost anything starting with http://,https://,ftp:// and so on. Well, it also matches ftp:\\ and ms-help://, but let's ignore that.

It may make sense, depending on actual usage, because the other regexp approach of whitelisting valid domains becomes non maintainable quickly enough, but making the protocol part optional does not make sense.

An example (with the relaxed protocol part in place):

>>> r = re.compile('(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:#@%/;$()~_?\+-=\\\.&]*)')
>>> r.search('oompaloompa_is_not_an_ur%&%%l').groups()[0]
'oompaloompa_is_not_an_ur%&%%l' #Matches!
>>> r.search('oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk').groups()[0]
'oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk' #Matches!
>>>

Given your edit I suggest you either make the user select what is he adding, adding an enum column, or create a simpler regex that'll check for at least a dot, besides the valid characters and maybe some common domains.

A third alternative which will be VERY SLOW and only to be used when URL validation is REALLY REALLY IMPORTANT is actually accessing the URL and do a HEAD request on it, if you get a host not found or an error you know it's not valid. For emails you could try and see if the MX host exists and has port 25 open. If both fails, it'll be plain text. (I'm not suggesting this either)

Vinko Vrsalovic
A: 

You can surround the prefix part in brackets and match 0 or 1 occurrences

(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+)?

So the whole regex will become

(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+)?[\w\d:#@%/;$()~?+-=\.&])

The problem with that is it's going to match more or less any word. For example "test" would also be a match.

Where are you going to use that regex? Are you trying to validate a hostname or are you trying to find hostnames inside a paragraph?

marto
I updated my post with my intent for this code.
Anders
A: 

The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.

\b
# Match the leading part (proto://hostname, or just hostname)
(
    # ftp://, http://, or https:// leading part
    (ftp|https?)://[-\w]+(\.\w[-\w]*)+
  |
    # or, try to find a hostname with our more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\b
        | edu\b
        | biz\b
        | gov\b
        | in(?:t|fo)\b # .int or .info
        | mil\b
        | net\b
        | org\b
        | name\b
        | coop\b
        | aero\b
        | museum\b
        | [a-z][a-z]\b # two-letter country codes
    )
)

# Allow an optional port number
( : \d+ )?

# The rest of the URL is optional, and begins with / . . . 
(
     /
     # The rest are heuristics for what seems to work well
     [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
     (?:
        [.!,?]+  [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
     )*
)?

To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.

To extract this from the website, go to http://regex.info/listing.cgi?ed=3&amp;p=207 It is from page 207 of the 3rd edition.

And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.

Hamish Downer
A: 

If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:

scheme ':' schemepart

where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.

That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:

BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.

frob  ->  http://frob

(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.

caskey