views:

75

answers:

6

Hi All,

Here's the regular expression I use, and I parse it using CAtlRegExp of MFC :

(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])

It works fine except with one flaw. When URL is preceded by characters, it still accepts it as a URL.

ex inputs:

  • this is a link www.google.com (where I can just tokenize the spaces and validate each word)

  • is...www.google.com (this string still matches the RegEx above :( )

Please help... Thanks...

+1  A: 

What about this one: (((f|ht)tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+) ?

thelost
`{1}` is just noise: it can be left out.
Bart Kiers
With `\+`, are you escaping the `+`, or trying to include the literal backslash? If it's the first, the `+` needs no escaping inside a character class and if it's the latter, it needs an extra backslash.
Bart Kiers
+2  A: 

You need to tell the regex to only match at the start and end of the string. I'm not sure how you do that in VC++ - in most regexs you enclose the pattern with ^ and $. The ^ says "the start of the string" and the $ says "the end of the string."

^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\\.]+[a-zA-Z0-9]+[\\.]+[a-zA-Z0-9])$

The second is matching because the string still contains a valid URL.

Andy Shellam
+1  A: 

Start the regex with ^ to and end it with $ to have the regex match only if the entire sting matches (if that's what you want):

^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])$
Michael Burr
+3  A: 
  1. Use the IgnoreCase flag instead of catering for each case.
  2. Stick a ^ at the beginning if you want the start of the string to be the start of the URL
  3. You're missing a lot of characters from possible, valid URLs.
Oli
+1  A: 

This Regular Expression has been tested to work for the following

http|https://host[:port]/[?][parameter=value]*

public static final String URL_PATTERN = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";

PS. It also validates on localhost link.

(Thoroughly written by me :-))

The Elite Gentleman
+1  A: 

How about using CUrl (that is, 'C-Url', in ATL, not curl as in libcurl) which can 'parse' urls with CUrl::CrackUrl . If that function returns FALSE you assume it's not a valid URL.

That said, decomposing URL is sufficiently complex to warrant a proper parser, not a regex based decomposition. Cfr. rfc 2396 etc. for an overview on the complexities.

Roel