views:

522

answers:

3

I need to use Javascript to do three things:

  1. Select all nodes with a class of "foo".
  2. Find all words inside these nodes that begin with "*".
  3. Surround those words with <span class="xyz"> ... </span>, where xyz is the word itself.

For example, the content:

<ul>
  <li class="foo">
    *abc def *ghi
  </li>
  <li class="bar">
    abc *def *ghi
  </li>
</ul>

would become

<ul>
  <li class="foo">
    <span class="abc">*abc</span> def <span class="ghi">*ghi</span>
  </li>
  <li class="bar">
    abc *def *ghi    <!-- Not part of a node with class "foo", so
  </li>                     no changes made. -->
</ul>

How might I do this? (P.S. Solutions involving jQuery work too, but other than that I'd prefer not include any additional dependencies.)

A: 

The regexp would look something like this (sed-ish syntax):

s/\*\(\w+\)\b\(?![^<]*>\)/<span class="\1">*\1</span>/g

Thus:

$('li.foo').each(function() {
    var html = $(this).html();
    html = html.replace(/\*(\w+)\b(?![^<]*>)/g, "<span class=\"$1\">*$1</span>");
    $(this).html(html);
});

The \*(\w+)\b segment is the important piece. It finds an asterisk followed by one or more word characters followed by some sort of word termination (e.g. end of line, or a space). The word is captured into $1, which is then used as the text and the class of the output.

The part just after that ((?![^<]*>)) is a negative lookahead. It asserts that a closing angle bracket does not follow, unless there is an opening angle bracket before it. This prevents a match where the string is inside an HTML tag. This doesn't handle malformed HTML, but that shouldn't be the case anyway.

strager
In the replacement string, javascript uses $1 instead of \\1. Also, I recommend text() instead of html() since you might have a class with the same name as a tag or attribute somewhere inside the <li>. Very nice!
system PAUSE
@system PAUSE, I'm not convinced about using text() instead of html(). If you update text(), it'll be escaped, no?
strager
@strager, You're right, text(t) will escape the new span, so it should be html(t). I'm trying to solve this case: <div class="foo"><em title="*xyz">*abc</em></div>. Using html() includes the whole <em> in the extracted text -- *xyz is replaced, resulting in invalid HTML. text() will yield only *abc.
system PAUSE
@system PAUSE, Ah, I see what you mean. I'll work on the regexp later to meet your requirements.
strager
@system PAUSE, Updated my answer. Works according to RegexBuddy (which uses .NET regexp AFAIK, but hopefully it works with JS as well).
strager
@strager: Thanks for your answer. Can you explain the regex portion?
@strager, negative look-behind isn't supported, at least not per standards, in Javascript RegExp.
system PAUSE
@strager, since 'words' are always in an element, and I don't think that unencoded '<' or '>' can be in an attribute value, do you think this might work? -- /(>[^<]*?)\*(\w+)\b/g
system PAUSE
@system PAUSE, That works except in the case where there is no element at the beginning of the string. I didn't know negative look-behind didn't work for JS. I'll look for an alternative.
strager
An unencoded ‘>’ is quite valid in an attribute value.
bobince
@bobince, thanks for the correction... I started looking at the DTD but my eyes were glazing over.
system PAUSE
(As for lookaround: none of them are officially supported. Most modern implementations do support lookahead, but you still can't use them because the RegExp implementation used in IE/JScript/VBScript is badly bugged. argh)
bobince
+2  A: 

No jQuery required:

UE_replacer = function (node) {

   // just for performance, skip attribute and
   // comment nodes (types 2 and 8, respectively)
   if (node.nodeType == 2) return;
   if (node.nodeType == 8) return;

   // for text nodes (type 3), wrap words of the
   // form *xyzzy with a span that has class xyzzy
   if (node.nodeType == 3) {

      // in the actual text, the nodeValue, change
      // all strings ('g'=global) that start and end
      // on a word boundary ('\b') where the first
      // character is '*' and is followed by one or
      // more ('+'=one or more) 'word' characters
      // ('\w'=word character). save all the word
      // characters (that's what parens do) so that
      // they can be used in the replacement string
      // ('$1'=re-use saved characters).
      var text = node.nodeValue.replace(
            /\b\*(\w+)\b/g,
            '<span class="$1">*$1</span>'   // <== Wrong!
      );

      // set the new text back into the nodeValue
      node.nodeValue = text;
      return;
   }

   // for all other node types, call this function
   // recursively on all its child nodes
   for (var i=0; i<node.childNodes.length; ++i) {
      UE_replacer( node.childNodes[i] );
   }
}

// start the replacement on 'document', which is
// the root node
UE_replacer( document );

Updated: To contrast the direction of strager's answer, I got rid of my botched jQuery and kept the regular expression as simple as possible. This 'raw' javascript approach turns out to be much easier than I expected.

Although jQuery is clearly good for manipulating DOM structure, it's actually not easy to figure out how to manipulate text elements.

system PAUSE
Not exactly sure, but this may trip up on things like: <li class="foo"><a href="#"><span class="*asdf">*asdf</span></a></li>
strager
@system: Can you explain how your answer differs from strager's? I'm having trouble seeing the difference between the two.
@Unknown Entity, system PAUSE's approach is to iterate through DOM nodes and perform replacements on the text nodes (and adding the span's as necessary). My approach just operates on the HTML contained within a single node. Both methods should work in theory just as well.
strager
@Unknown Entity, strager's got it right, except that my version still has a bug ... see the first comment on this answer.
system PAUSE
If you put “<span>” into a nodeValue, you just get the text “<span>” and not an element. Also \b\*\w won't work: * is not a word character, so there won't be a word boundary before it.
bobince
@bobince, noted, and thanks. I might get back to this later ... meanwhile, I would vote myself down if I could, to keep this out of the way.
system PAUSE
Bravo on your new method. =]
strager
+1  A: 

Don't try to process the innerHTML/html() of an element. This will never work because regex is not powerful enough to parse HTML. Just walk over the Text nodes looking for what you want:

// Replace words in text content, recursively walking element children.
//
function wrapWordsInDescendants(element, tagName, className) {
    for (var i= element.childNodes.length; i-->0;) {
        var child= element.childNodes[i];
        if (child.nodeType==1) // Node.ELEMENT_NODE
            wrapWordsInDescendants(child, tagName, className);
        else if (child.nodeType==3) // Node.TEXT_NODE
            wrapWordsInText(child, tagName, className);
    }
}

// Replace words in a single text node
//
function wrapWordsInText(node, tagName, className) {

    // Get list of *word indexes
    //
    var ixs= [];
    var match;
    while (match= starword.exec(node.data))
        ixs.push([match.index, match.index+match[0].length]);

    // Wrap each in the given element
    //
    for (var i= ixs.length; i-->0;) {
        var element= document.createElement(tagName);
        element.className= className;
        node.splitText(ixs[i][1]);
        element.appendChild(node.splitText(ixs[i][0]));
        node.parentNode.insertBefore(element, node.nextSibling);
    }
}
var starword= /(^|\W)\*\w+\b/g;

// Process all elements with class 'foo'
//
$('.foo').each(function() {
    wrapWordsInDescendants(this, 'span', 'xyz');
});


// If you're not using jQuery, you'll need the below bits instead of $...

// Fix missing indexOf method on IE
//
if (![].indexOf) Array.prototype.indexOf= function(item) {
    for (var i= 0; i<this.length; i++)
        if (this[i]==item)
            return i;
    return -1;
}

// Iterating over '*' (all elements) is not fast; if possible, reduce to
// all elements called 'li', or all element inside a certain element etc.
//
var elements= document.getElementsByTagName('*');
for (var i= elements.length; i-->0;)
    if (elements[i].className.split(' ').indexOf('foo')!=-1)
        wrapWordsInDescendants(elements[i], 'span', 'xyz');
bobince
@bobince, nice structure! and I like your regex (I wasn't sure about \b matching in front of *). However, you are setting all classnames to "xyz". The classname for starword *abc must be "abc".
system PAUSE
Oh, OK, in that case: “element.className= element.firstChild.data” (after the insertBefore).
bobince