views:

64

answers:

2

how do I find URLs (i.e. www.domain.com) within a document, and put those within anchors: < a href="www.domain.com" >www.domain.com< /a >

html:

Hey dude, check out this link www.google.com and www.yahoo.com!

javascript:

(function(){var text = document.body.innerHTML;/*do replace regex => text*/})();

output:

Hey dude, check out this link <a href="www.google.com">www.google.com</a> and <a href="www.yahoo.com">www.yahoo.com</a>!
+1  A: 

I've never used it, but this looks like a decent bit of code to leverage:

http://github.com/cowboy/javascript-linkify

timdev
+3  A: 

Firstly, www.domain.com isn't a URL, it's a hostname, and

<a href="www.domain.com">

won't work — it'll look for a .com file called www.domain relative to the current page.

It's not possible to highlight hostnames in the general case because almost anything can be a hostname. You could try to highlight ‘www.something.dot.separated.words’, but it's not really that reliable and there are many sites that don't use the www. hostname prefix. I'd try to avoid that.

/\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/;

This is an very liberal pattern you could use as a starting point for detecting HTTP URLs. Depending on what sort of input you've got you may want to narrow down what it allows, and it may be worth detecting trailing characters like . or ! that would be valid parts of the URL but in practice generally aren't.

(You could use a | to allow either the URL syntax or the www.hostname syntax, if you like.)

Anyhow, once you've settled on your preferred pattern you'll need to find that pattern in text nodes on the page. Don't run the regexp over innerHTML markup. You'll end up completely ruining the page by trying to mark up every href="http://something" that's already inside markup. You'll also destroy any existing JavaScript references, events or form field values when you replace the innerHTML content.

In general regexp simply cannot process HTML in any reliable way. So take advantage of the fact that the browser has already parsed the HTML into elements and text nodes, and just look at the text nodes. You'll also want to avoid looking inside <a> elements, since marking up a URL as a link when it's already in a link is silly (and invalid).

// Mark up `http://...` text in an element and its descendants as links.
//
function addLinks(element) {
    var urlpattern= /\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/g;
    findTextExceptInLinks(someelement, urlpattern, function(node, match) {
        node.splitText(match.index+match[0].length);
        var a= document.createElement('a');
        a.href= match[0];
        a.appendChild(node.splitText(match.index));
        node.parentNode.insertBefore(a, node.nextSibling);
    });
}

// Find text in descendents of an element, in reverse document order
// pattern must be a regexp with global flag
//
function findTextExceptInLinks(element, pattern, callback) {
    for (var childi= element.childNodes.length; childi-->0;) {
        var child= element.childNodes[childi];
        if (child.nodeType===1) {
            if (child.tagName.toLowerCase()!=='a')
                findTextExceptInLinks(child, pattern, callback);
        } else if (child.nodeType===3) {
            var matches= [];
            var match;
            while (match= pattern.exec(child.data))
                matches.push(match);
            for (var i= matches.length; i-->0;)
                callback.call(window, child, matches[i]);
        }
    }
}
bobince