ansaurus

Question

Regex to match URL

Answer 1

+5 A:

You should try one of the well tested expressions on RegexLib.com rather than rolling your own unless your requirement is unusual. URI/Domain name matches are quite a common requirement.

Cerebrus 2009-07-17 07:36:19

Answer 2

A:

Using a single regexp to match an URL string makes the code incredible unreadable. I'd suggest to use parse_url to split the URL into its components (which is not a trivial task), and check each part with a regexp.

Bluehorn 2009-07-17 07:39:15

Answer 3

A:

@Bluehorn (I can't comment yet) That's the eternal battle with regex's.

I think they do the code unreadable to those who doesn't understand it. All it needs is a comment so you can remember what it matches. That's all. If you write your own code to parse an url, you'll end with a lot of code that need to be well comented. And depending on the programmer the resultant code can be worst than a complex regex.

But that's my opinion and not what the OP wants to know.

clinisbut 2009-07-17 07:45:08

I am not against using regexps for this. I am against writing it yourself instead of using standard features of e.g. PHP. Reinventing the wheel will only cause headaches...Anyway, I think I misunderstood the question since he is not trying to match URLs but looks for valid domain names in text.

Bluehorn 2009-07-17 07:52:30

Then I agree with you that you shouldn't reinvent the wheel with regex's ;)

clinisbut 2009-07-17 07:54:03

Answer 4

A:

Changing the end of the regex to (/\S*)?)$ should solve your problem.

To explain what that is doing -

it is looking for / followed by some characters (not whitespace)
this match is optional, ? indicated 0 or 1 times
and finally it should be followed by a end of string (or change it to \b for matching on a word boundary).

benophobia 2009-07-17 07:45:32

Answer 5

A:

$ : The dollar signifies the end of the string.
For example \d*$ will match strings which end with a digit. So you need to add the $!

Matthieu 2009-07-17 07:51:47

Answer 6

+1 A:

$search  = "#^((?#
    the scheme:
  )(?:https?://)(?#
    second level domains and beyond:
  )(?:[\S]+\.)+((?#
    top level domains:
  )MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
  )COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
  )A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
  )C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
  )E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
  )H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
  )K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
  )N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
  )S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
  )U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
    the path, can be there or not:
  )(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?#

to

  )(?:https?://)?(?#

(there is a ? extra) to match 'URLs' without the scheme.

Boldewyn 2009-07-17 08:07:14

but i dont want the http:// in the beginning to compulsory. as i want it to match "abc.com" also.

Alec Smart 2009-07-17 08:11:54

seems like we commented/edited synchronuously. Fixed.

Boldewyn 2009-07-17 08:13:19

can you please improve [\S]* to probably no spaces + only words + only numbers or whatever that is allowed in a URL?

Alec Smart 2009-07-17 10:20:49

\S should never match spaces... I updated it to what Wikipedia http://en.wikipedia.org/wiki/How_to_edit#Links_and_URLs allows in it's URLs. That looks reasonable.

Boldewyn 2009-07-17 10:40:03

Answer 7

A:

Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:

^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$

no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).

Diego Perini 2010-08-18 02:23:27

ansaurus

tags:

views:

answers:

Regex to match URL

related questions