tags:

views:

1421

answers:

3

I have used the following regex to get the urls from text (e.g. this is text http://url.com/blabla possibly some more text).

'@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)@'

This works for all URLs but I just found out it doesn't work for URLs shortened like: "blabla bla http://ff.im/-bEnA blabla" turns becomes http://ff.im/ after the match.

I suspect it has to do with the dash - after the slash /.

Any help on how to update this regex would be amazing.

Ice

+3  A: 

Short answer: [\w/_\.] doesn't match - so make it [-\w/_\.]

Long answer:

@              - delimiter
(              - start of group
    https?://  - http:// or https://
    ([-\w.]+)+ - capture 1 or more hyphens, word characters or dots, 1 or more times.. this seems odd - don't know what the second + is for
    (:\d+)?    - optionally capture a : and some numbers (the port)
    (          - start of group
     /            - leading slash
     (            - start of group
      [\w/_\.] - any word character, underscore or dot - you need to add hyphen to this list or just make it [^?\S] - any char except ? or whitespace (the path + filename)
         (\?\S+)? - optionally capture a ? followed by anything except whitespace (the querystring)
     )?     - close group, make it optional
    )?         - close group, make it optional
)              - close group
@
Greg
Note that \w (letters + numbers) leaves out valid characters. RFC1738 says that "only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL": i.e. just about anything can appear in a URL
Luca Tettamanti
A: 

An HTTP(s) URL can contain almost any character after the host and port section except for a space character (which is mapped to a +).

Alnitak
A: 

Jeff Atwood blogged about this a while ago: http://www.codinghorror.com/blog/archives/001181.html

Luca Tettamanti