tags:

views:

435

answers:

2

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so?

\b(?P<link>(?:.*?://)[\w\-\_\.\@\:\/\?\#\=]*)\b
+7  A: 

Even though the question is vague, I'll attempt to respond with possible solutions.

Possible Intention 1: To match any URL's in a given file (for replacement):

/^([^:]+):\/\/([-\w._]+)(\/[-\w._]\?(.+)?)?$/ig

The above should match nearly all URL formats, with the following captured groups:

0 => entire match
1 => protocol (eg. html, ftp, git, ...)
2 => hostname (eg. www.stackoverflow.com)
3 => requested_file_path (eg. /images/prod/1/4/success.gif)
4 => query_string (eg. param=1&param2=2&param3=3)

Possible Intention 2: To get details about the current request url

In order to get details about the url such as the protocol, hostname, requested file path, and query string, you're better off using language/object methods to gather the results. In php you can get all of the above information using function calls:

$protocol = $_SERVER['SERVER_PROTOCOL']; // HTTP/1.0
$host = $_SERVER['HTTP_HOST']; // www.stackoverflow.com
$path_to_file = dirname($_SERVER['SCRIPT_NAME']);
$file = basename($_SERVER['SCRIPT_NAME']);
$query_string = $_SERVER['QUERY_STRING'];

Hope this helps in any way.

localshred
Wish I could add this as a favorite answer
tj111
@tj111 Thanks! I'm glad I could help. You could favorite the question, so you always have a quick way to get back.
localshred
A: 

I guess SO blocks comments after a while? localshred's answer is great, except for a missing wildcard and unescaped periods:

    /^([^:]+):\/\/([-\w\._]+)(\/[-\w\._]*\?(.+)?)?$/ig
                                        ^-- wildcard
                        ^
    we dont want to match everything ^
mitjak
@mitjak : You don't need to escape periods within a range specifier, e.g. in the pattern `/.[.]/` the first dot means "any character" while the second dot (within the [brackets]) means a literal period.
Stephen P
Does that apply to java as well? At work I seem to recall getting different results when the period was not escaped.
mitjak