views:

48

answers:

2

I know this has been talked here, but no solutions were offer to the exact problem. Please, take a look...

I'm using a function to transform plain-text URLs into clickable links. This is what I have:

<script type='text/javascript' language='javascript'>

window.onload = autolink;

function autolink(text) {

var exp = /(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/gim;

document.body.innerHTML = document.body.innerHTML.replace(exp,"<a href='$1'>$1</a>"); 

}

</script>

This makes

http://stackoverflow.com/

Looks like:

http://stackoverflow.com/

It works, but also replace the existent HTML links with nested links.

So, a valid HTML link like

<a href="http://stackoverflow.com/"&gt;StackOverflow&lt;/a&gt;

Becomes something messy like:

<a href="http://stackoverflow.com/&lt;a href="http://stackoverflow.com/"&gt;StackOverflow&lt;/a&gt;"&gt;StackOverflow&lt;/a&gt;...

How can I fix the expression to ignore the content of link tags? Thanks!

I'm a newbie... I barely understand the regex code. Please be gentle :) Thanks again.

+3  A: 

This problem is beyond the power of regular expressions. You might be able to write a regex that could avoid some links, but you wouldn't be able to avoid every existing link.

The good news is that a different approach will make the job much easier. Right now you using document.body.innerHTML to manipulate the HTML as plain text. To do it correctly that way, you will basically need to parse the HTML yourself. But you don't have to, because the browser has already parsed it for you!

The web browser allows you to access an HTML document as a series of object. It's called the Document Object Model (DOM) and if you do some reading on that, you should be able to learn how to traverse through the HTML, skipping over anything inside an A element, and using the regex you have on plain text only.

benzado
Thanks! I'll try.
Matias
+1  A: 

Using the jQuery JavaScript library, this would look like (demo at http://jsfiddle.net/BRPRH/4):

function autolink() {
    var exp = /(\b(https?|ftp):\/\/[-A-Z0-9+\u0026@#\/%?=~_|!:,.;]*[-A-Z0-9+\u0026@#\/%=~_|])/gi,
        lt = '\u003c',
        gt = '\u003e';

    $('*:not(a, script, style, textarea)').contents().each(function() {
        if (this.nodeType == Node.TEXT_NODE) {
            var textNode = $(this);
            var span = $(lt + 'span/' + gt).text(this.nodeValue);
            span.html(span.html().replace(exp, lt + 'a href=\'$1\'' + gt + '$1' + lt + '/a' + gt));
            textNode.replaceWith(span);
        }
    });
}

$(autolink);

Edit: Excluded textareas, scripts, and embedded CSS. I note that this can also be done using pure DOM's splitText, which has the advantage of not adding extra span elements.

Edit 2: Eliminated all ampersands and double quotes.

Edit 3: Got rid of < and > characters as well.

idealmachine
Matias
Anyway, I learned some interesting things thanks to you. Maybe the script should exclude img tags as well...
Matias
@Matias: I've edited the script to eliminate all ampersands and double quotes, if you think that's a problem.
idealmachine
Also seems like Blogger replaces $('<span/>') with $("<span></span>").
Matias
Thanks for your time and patience, but i have lost. Blogger don't accept the A tag without the quotes. Blogger says: «Open quote is expected for attribute "{1}" associated with an element type "href".»
Matias
Blogger accepts your third edit as valid. Anyway, the script is not writing anything on the page. It keep changing '\u003c' with '\u003c' I do not want you to go crazy. Thanks for the help... I'll keep trying.
Matias