views:

70

answers:

4

Background:

I found similiar S.O. posts on this topic, but I failed to make it work for my scenario. Appologies in advance if this is a dupe.

My Intent:

Take every English word in a string, and convert it to a html hyperlink. This logic needs to ignore only the following markup: <br/>, <b>, </b>

Here's what I have so far. It converts English words to hyperlinks as I expect, but has no ignore logic for html tags (that's where I need your help):

text = text.replace(/\b([A-Z\-a-z]+)\b/g, "<a href=\"?q=$1\">$1</a>");

Example Input / Output:

Sample Input:

this <b>is</b> a test

Expected Output:

<a href="?q=this">this</a> <b><a href="?q=is">is</a></b> <a href="?q=a">a</a> <a href="?q=test">test</a>

Thank you.

A: 

Issues with regexing HTML aside, the way I'd do this is in two steps:

  • First of foremost, one way or another, extract the texts outside the tags
  • Then only do this transform to these texts, and leave everything else untouched

Related questions

polygenelubricants
A: 

Here's a hybrid solution that gives you the performance gain of innerHTML and the luxury of not having to mess with HTML strings when looking for the matches:

function findMatchAndReplace(node, regex, replacement) {

    var parent,
        temp = document.createElement('div'),
        next;

    if (node.nodeType === 3) {

        parent = node.parentNode;

        temp.innerHTML = node.data.replace(regex, replacement);

        while (temp.firstChild)
            parent.insertBefore(temp.firstChild, node);

        parent.removeChild(node);

    } else if (node.nodeType === 1) {

        if (node = node.firstChild) do {
            next = node.nextSibling;
            findMatchAndReplace(node, regex, replacement);
        } while (node = next);

    }

}

Input:

<div id="foo">
    this <b>is</b> a test
</div>

Process:

findMatchAndReplace(
    document.getElementById('foo'),
    /\b\w+\b/g,
    '<a href="?q=$&">$&</a>'
);

Output (whitespace added for clarity):

<div id="foo">
    <a href="?q=this">this</a>
    <b><a href="?q=is">is</a></b>
    <a href="?q=a">a</a>
    <a href="?q=test">test</a>
</div>
J-P
Thanks J-P, a smart solution. And sorry for not clarifying in my question that although I'm setting a DOM node via InnerHTML, the original source of my raw text is from an AJAX call, so I don't have a DOM node to start with (as your first param requires).
Matt
@mrscott, you can still get a DOM structure from an Ajax response though. `function toDom(str) { var d = document.createElement('div'); d.innerHTML = str; return d; }`
J-P
Thanks J-P. That does work, but I ran some perf tests on the solution I posted and found that your approach would be about a 10x slower on a large body of text containing a sparse set of <br/> <b> and </b> tags (which mimicks the real scenario I'm facing).
Matt
A: 

Here's another JavaScript method.

var StrWith_WELL_FORMED_TAGS    = "This <b>is</b> a test, <br> Mr. O'Leary! <!-- What about comments? -->";
var SplitAtTags                 = StrWith_WELL_FORMED_TAGS.split (/[<>]/);
var ArrayLen                    = SplitAtTags.length;
var OutputStr                   = '';

var bStartWithTag               = StrWith_WELL_FORMED_TAGS.charAt (0) == "<";

for (var J=0;  J < ArrayLen;  J++)
{
    var bWeAreInsideTag         = (J % 2) ^ bStartWithTag;

    if (bWeAreInsideTag)
    {
        OutputStr              += '<' + SplitAtTags[J] + '>';
    }
    else
    {
        OutputStr              += SplitAtTags[J].replace (/([a-z']+)/gi, '<a href="?q=$1">$1</a>');
    }
}

//-- Replace "console.log" with "alert" if not using Firebug.
console.log (OutputStr);
Brock Adams
nice solution Brock, but I was hoping for something more concise. I think I have the solution - I'll post it shortly.
Matt
A: 

Thank you all for your answers. I believe I have the solution that works best for my scenario. Sorry for not being more clear in my question - but I'm actually manipulating a raw string (from an AJAX call), then rendering it in a DOM element, via innerHTML. Because of this, I cannot use DOM traversal to parse the original string.

My solution is as follows. Using a regular expression, first I break the string into fragments that are not within html tags, then a callback is fired on each fragment to run the original regex in my question to replace it's contents with anchor links.

Here it goes:

text = text.replace(/[^<>]\w[^</>]+/g, 
                     function(s) {
                        return s.replace(/\b\w+\b/g, "<a href=\"?q=$&\">$&</a>");
                      });
Matt
This will break on all kinds of valid HTML. `<br />` or `<div class='SomeClass'>`, for example. It also hyperlinks words in comments and hyperlinks contractions as two words separated by an apostrophe!
Brock Adams
Thanks for the comment Brock, but as I wrote in my question, I don't need a general solution for all valid HTML. As per my requirements, I only need this to work for a subset of HTML, specifically to ignore only the following markup: <br/>, <b>, </b>
Matt