ansaurus

Question

How do I use Javascript to modify the content of a node?

Answer 1

A:

The regexp would look something like this (sed-ish syntax):

s/\*\(\w+\)\b\(?![^<]*>\)/<span class="\1">*\1</span>/g

Thus:

$('li.foo').each(function() {
    var html = $(this).html();
    html = html.replace(/\*(\w+)\b(?![^<]*>)/g, "<span class=\"$1\">*$1</span>");
    $(this).html(html);
});

The \*(\w+)\b segment is the important piece. It finds an asterisk followed by one or more word characters followed by some sort of word termination (e.g. end of line, or a space). The word is captured into $1, which is then used as the text and the class of the output.

The part just after that ((?![^<]*>)) is a negative lookahead. It asserts that a closing angle bracket does not follow, unless there is an opening angle bracket before it. This prevents a match where the string is inside an HTML tag. This doesn't handle malformed HTML, but that shouldn't be the case anyway.

strager 2009-04-02 23:03:52

In the replacement string, javascript uses $1 instead of \\1. Also, I recommend text() instead of html() since you might have a class with the same name as a tag or attribute somewhere inside the <li>. Very nice!

system PAUSE 2009-04-02 23:29:26

@system PAUSE, I'm not convinced about using text() instead of html(). If you update text(), it'll be escaped, no?

strager 2009-04-03 00:40:11

@strager, You're right, text(t) will escape the new span, so it should be html(t). I'm trying to solve this case: <div class="foo"><em title="*xyz">*abc</em></div>. Using html() includes the whole <em> in the extracted text -- *xyz is replaced, resulting in invalid HTML. text() will yield only *abc.

system PAUSE 2009-04-03 01:23:11

@system PAUSE, Ah, I see what you mean. I'll work on the regexp later to meet your requirements.

strager 2009-04-03 01:33:56

@system PAUSE, Updated my answer. Works according to RegexBuddy (which uses .NET regexp AFAIK, but hopefully it works with JS as well).

strager 2009-04-03 01:44:43

@strager: Thanks for your answer. Can you explain the regex portion?

2009-04-03 01:48:07

@strager, negative look-behind isn't supported, at least not per standards, in Javascript RegExp.

system PAUSE 2009-04-03 02:29:06

@strager, since 'words' are always in an element, and I don't think that unencoded '<' or '>' can be in an attribute value, do you think this might work? -- /(>[^<]*?)\*(\w+)\b/g

system PAUSE 2009-04-03 02:45:18

@system PAUSE, That works except in the case where there is no element at the beginning of the string. I didn't know negative look-behind didn't work for JS. I'll look for an alternative.

strager 2009-04-03 02:52:08

An unencoded ‘>’ is quite valid in an attribute value.

bobince 2009-04-03 03:27:58

@bobince, thanks for the correction... I started looking at the DTD but my eyes were glazing over.

system PAUSE 2009-04-03 03:30:10

(As for lookaround: none of them are officially supported. Most modern implementations do support lookahead, but you still can't use them because the RegExp implementation used in IE/JScript/VBScript is badly bugged. argh)

bobince 2009-04-03 03:35:11

Answer 2

+2 A:

No jQuery required:

UE_replacer = function (node) {

   // just for performance, skip attribute and
   // comment nodes (types 2 and 8, respectively)
   if (node.nodeType == 2) return;
   if (node.nodeType == 8) return;

   // for text nodes (type 3), wrap words of the
   // form *xyzzy with a span that has class xyzzy
   if (node.nodeType == 3) {

      // in the actual text, the nodeValue, change
      // all strings ('g'=global) that start and end
      // on a word boundary ('\b') where the first
      // character is '*' and is followed by one or
      // more ('+'=one or more) 'word' characters
      // ('\w'=word character). save all the word
      // characters (that's what parens do) so that
      // they can be used in the replacement string
      // ('$1'=re-use saved characters).
      var text = node.nodeValue.replace(
            /\b\*(\w+)\b/g,
            '<span class="$1">*$1</span>'   // <== Wrong!
      );

      // set the new text back into the nodeValue
      node.nodeValue = text;
      return;
   }

   // for all other node types, call this function
   // recursively on all its child nodes
   for (var i=0; i<node.childNodes.length; ++i) {
      UE_replacer( node.childNodes[i] );
   }
}

// start the replacement on 'document', which is
// the root node
UE_replacer( document );

Updated: To contrast the direction of strager's answer, I got rid of my botched jQuery and kept the regular expression as simple as possible. This 'raw' javascript approach turns out to be much easier than I expected.

Although jQuery is clearly good for manipulating DOM structure, it's actually not easy to figure out how to manipulate text elements.

system PAUSE 2009-04-02 23:23:49

Not exactly sure, but this may trip up on things like: <li class="foo"><a href="#"><span class="*asdf">*asdf</span></a></li>

strager 2009-04-03 01:46:35

@system: Can you explain how your answer differs from strager's? I'm having trouble seeing the difference between the two.

2009-04-03 01:48:37

@Unknown Entity, system PAUSE's approach is to iterate through DOM nodes and perform replacements on the text nodes (and adding the span's as necessary). My approach just operates on the HTML contained within a single node. Both methods should work in theory just as well.

strager 2009-04-03 02:29:01

@Unknown Entity, strager's got it right, except that my version still has a bug ... see the first comment on this answer.

system PAUSE 2009-04-03 02:47:29

If you put “<span>” into a nodeValue, you just get the text “<span>” and not an element. Also \b\*\w won't work: * is not a word character, so there won't be a word boundary before it.

bobince 2009-04-03 03:38:11

@bobince, noted, and thanks. I might get back to this later ... meanwhile, I would vote myself down if I could, to keep this out of the way.

system PAUSE 2009-04-03 03:56:57

Bravo on your new method. =]

strager 2009-04-04 17:31:50

Answer 3

+1 A:

Don't try to process the innerHTML/html() of an element. This will never work because regex is not powerful enough to parse HTML. Just walk over the Text nodes looking for what you want:

// Replace words in text content, recursively walking element children.
//
function wrapWordsInDescendants(element, tagName, className) {
    for (var i= element.childNodes.length; i-->0;) {
        var child= element.childNodes[i];
        if (child.nodeType==1) // Node.ELEMENT_NODE
            wrapWordsInDescendants(child, tagName, className);
        else if (child.nodeType==3) // Node.TEXT_NODE
            wrapWordsInText(child, tagName, className);
    }
}

// Replace words in a single text node
//
function wrapWordsInText(node, tagName, className) {

    // Get list of *word indexes
    //
    var ixs= [];
    var match;
    while (match= starword.exec(node.data))
        ixs.push([match.index, match.index+match[0].length]);

    // Wrap each in the given element
    //
    for (var i= ixs.length; i-->0;) {
        var element= document.createElement(tagName);
        element.className= className;
        node.splitText(ixs[i][1]);
        element.appendChild(node.splitText(ixs[i][0]));
        node.parentNode.insertBefore(element, node.nextSibling);
    }
}
var starword= /(^|\W)\*\w+\b/g;

// Process all elements with class 'foo'
//
$('.foo').each(function() {
    wrapWordsInDescendants(this, 'span', 'xyz');
});


// If you're not using jQuery, you'll need the below bits instead of $...

// Fix missing indexOf method on IE
//
if (![].indexOf) Array.prototype.indexOf= function(item) {
    for (var i= 0; i<this.length; i++)
        if (this[i]==item)
            return i;
    return -1;
}

// Iterating over '*' (all elements) is not fast; if possible, reduce to
// all elements called 'li', or all element inside a certain element etc.
//
var elements= document.getElementsByTagName('*');
for (var i= elements.length; i-->0;)
    if (elements[i].className.split(' ').indexOf('foo')!=-1)
        wrapWordsInDescendants(elements[i], 'span', 'xyz');

bobince 2009-04-03 03:32:51

@bobince, nice structure! and I like your regex (I wasn't sure about \b matching in front of *). However, you are setting all classnames to "xyz". The classname for starword *abc must be "abc".

system PAUSE 2009-04-03 03:47:02

Oh, OK, in that case: “element.className= element.firstChild.data” (after the insertBefore).

bobince 2009-04-03 06:13:09

ansaurus

tags:

views:

answers:

How do I use Javascript to modify the content of a node?

related questions