tags:

views:

2593

answers:

7

I am using a regular expression to convert plain text URL to clickable links.

@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)@

However, sometimes in the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".

http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124

Is it legal to have a ; in a URL or should mark ; as a end of URL marker? How would that fit in my regular expression?

+2  A: 

http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.

EricLaw -MSFT-
+14  A: 

A semicolon is reserved and may not be used unencoded except for its special purpose (which depends on the scheme). Section 2.2:

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

Greg
"may not be used unencoded": ... for a purpose other than its special meaning. The correct answer to the question is "Yes, it is legal to have a semicolon in a URL", but the impression I get from this answer (not the spec quote, but the summary) is "No, an unencoded semicolon may not be used in a URL."
Miles
@Miles edited to clarify
Greg
R. Bemrose
+2  A: 

The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt

However, the specification states that whether the semi-color is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.

+4  A: 

The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.

+1  A: 

Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.

I replaced the Regex we were using with the following pattern (and have tested that it works):

 string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-zA-Z0-9-]+\.[a-zA-Z0-9\/&#95;:@=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";

This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)

daviddlyman
I added code formatting so we could read it more easily, but I don't recommend using that regex. Leaving aside the obvious web mangling and the many redundant backslashes and pipes, the final two character classes are seriously flawed. Not only do they exclude valid characters like semicolons and parentheses, that last one matches all kinds of *invalid* characters like quotation marks, braces, and non-ASCII characters.
Alan Moore
+2  A: 

Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..

If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:

https?://[\w!#$%&'()*+,./:;=?@\[\]-]+(?<![!,.?;:"'()-])

After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.

Alan Moore
A: 

Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.

And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.

In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.

For example, sending someone an email with this link:

http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/kiddie_porn_movie/

will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/kiddie_porn_movie/) presumably exists* and the user will be directed to the kiddie_porn_movie page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.

And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.

*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.

No Spam