views:

857

answers:

10

How would one go about spotting URIs in a block of text?

The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).

I would prefer a solution in C# if possible. Thank you.

A: 

For a lot of the protocols you could just search for "://" without the quotes. Not sure about the others though.

mdec
A: 

A list of Official IANA registered URI schemes can be found in Wikipedia. I agree that searching the web for a premade regular expression might be the best idea.

Aleksi
A: 

May want to consider regular expressions: http://www.perlmonks.org/?node_id=533485

Rick Kierner
+1  A: 

Whether or not something is a URI is context-dependent. In general the only thing they always have in common is that they start "scheme_name:". The scheme name can be anything (subject to legal characters). But other strings also contain colons without being URIs.

So you need to decide what schemes you're interested in. Generally you can get away with searching for "scheme_name:", followed by characters up to a space, for each scheme you care about. Unfortunately URIs can contain spaces, so if they're embedded in text they are potentially ambiguous. There's nothing you can do to resolve the ambiguity - the person who wrote the text would have to fix it. URIs can optionally be enclosed in <>. Most people don't do that, though, so recognising that format will only occasionally help.

The Wikipedia article for URI lists the relevant RFCs.

[Edit to add: using regular expressions to fully validate URIs is a nightmare - even if you somehow find or create one that's correct, it will be very large and difficult to comment and maintain. Fortunately, if all you're doing is highlighting links, you probably don't care about the odd false positive, so you don't need to validate. Just look for "http://", "mailto:\S*@", etc]

Steve Jessop
A: 

You could take a look at the source code of Regexp::Common::URI.

cubex
+1  A: 

Here is a code snippet with regular expressions for various needs:

http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/

Drejc
A: 

That is not easy to do, if you want to also match "something.tld", because normal text will have many instances of that pattern, but if you want to match only URIs that begin with a scheme, you can try this regular expression (sorry, I don't know how to plug it in C#)

(http|https|ftp|mailto|tel):\S+[/a-zA-Z0-9]

You can add more schemes there, and it will match the scheme until the next whitespace character, taking into account that the last character is not invalid (for example as in the very usual string "http://www.example.com.")

Victor
A: 

The following perl regexp should pull do the trick. Does c# have perl regexps?

/\w+:\/\/[\w][\w.\/]*/

J.D. Fitz.Gerald
A: 

Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.

To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):

\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*

This would match http://example.com/foo/bar-baz

and ftp://192.168.0.1/foo/file.txt

but would cause problems for at least these:

  • mailto:[email protected] (no match - no //, but present @)
  • ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
  • ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
  • nonexistantscheme://obvious.false.positive
  • http://www.google.com/search?q=uri+regular+expression (match, but query isn't I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.

If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.

If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. \s(\w:\S+)\s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.

If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.

jamesh
A: 

the URL Tool for Ubiquity does the following:

findURLs: function(text) {
    var urls = [];
    var matches = text.match(/(\S+\.{1}[^\s\,\.\!]+)/g);
    if (matches) {
        for each (var match in matches) {
            urls.push(match);
        }
    }
    return urls;
},
Sam Hasler