tags:

views:

1151

answers:

7

I am using the following regex to match a URL:

$search  = "/([\S]+\.(MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|au|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BL|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|EH|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MF|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|R|H|TP|TR|TT|TV|TW|TZ|UA|UG|UK|UM|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|YU|ZA|ZM|ZW)([\S]*))/i";

But its a bit screwed up because it also matches "abc.php" which I dont want. and something like abc...test. I want it to match abc.com though. and www.abc.com as well as http://abc.com.

It just needs a slight tweak at the end but I am not sure what. (there should be a slash after the any domain name which it is not checking for right now and it is only checking \S)

thank you for your time.

+5  A: 

You should try one of the well tested expressions on RegexLib.com rather than rolling your own unless your requirement is unusual. URI/Domain name matches are quite a common requirement.

Cerebrus
A: 

Using a single regexp to match an URL string makes the code incredible unreadable. I'd suggest to use parse_url to split the URL into its components (which is not a trivial task), and check each part with a regexp.

Bluehorn
A: 

@Bluehorn (I can't comment yet) That's the eternal battle with regex's.

I think they do the code unreadable to those who doesn't understand it. All it needs is a comment so you can remember what it matches. That's all. If you write your own code to parse an url, you'll end with a lot of code that need to be well comented. And depending on the programmer the resultant code can be worst than a complex regex.

But that's my opinion and not what the OP wants to know.

clinisbut
I am not against using regexps for this. I am against writing it yourself instead of using standard features of e.g. PHP. Reinventing the wheel will only cause headaches...Anyway, I think I misunderstood the question since he is not trying to match URLs but looks for valid domain names in text.
Bluehorn
Then I agree with you that you shouldn't reinvent the wheel with regex's ;)
clinisbut
A: 

Changing the end of the regex to (/\S*)?)$ should solve your problem.

To explain what that is doing -

  • it is looking for / followed by some characters (not whitespace)
  • this match is optional, ? indicated 0 or 1 times
  • and finally it should be followed by a end of string (or change it to \b for matching on a word boundary).
benophobia
A: 

$ : The dollar signifies the end of the string.
For example \d*$ will match strings which end with a digit. So you need to add the $!

Matthieu
+1  A: 
$search  = "#^((?#
    the scheme:
  )(?:https?://)(?#
    second level domains and beyond:
  )(?:[\S]+\.)+((?#
    top level domains:
  )MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
  )COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
  )A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
  )C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
  )E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
  )H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
  )K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
  )N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
  )S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
  )U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
    the path, can be there or not:
  )(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?#

to

  )(?:https?://)?(?#

(there is a ? extra) to match 'URLs' without the scheme.

Boldewyn
but i dont want the http:// in the beginning to compulsory. as i want it to match "abc.com" also.
Alec Smart
seems like we commented/edited synchronuously. Fixed.
Boldewyn
can you please improve [\S]* to probably no spaces + only words + only numbers or whatever that is allowed in a URL?
Alec Smart
\S should never match spaces... I updated it to what Wikipedia http://en.wikipedia.org/wiki/How_to_edit#Links_and_URLs allows in it's URLs. That looks reasonable.
Boldewyn
A: 

Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:

^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$

no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).

Diego Perini