ansaurus

Question

Regexp to search/replace only text, not in HTML attribute

Answer 1

A:

Hi,

Html is not a "regular language", therefore regex is not the optimal tool for parsing it. You might be better suited to use a html parser like this one to get at the attribute and then apply regex to do something with the value.

Enjoy!

Doug 2010-08-11 15:29:11

That's a Java HTML parser. He want to do this in JavaScript.

BalusC 2010-08-11 16:37:40

Answer 2

A:

Don't parse ~~regex~~HTML with ~~HTML~~regex. If you know your HTML is well-formed, use an HTML/XML parser. Otherwise, run it through Tidy first and then use an XML parser.

Vivin Paliath 2010-08-11 15:29:13

You probably mean “don’t parse HTML with regex”, not the other way around. ;)

Scytale 2010-08-11 15:31:29

@Scytale - He's just being thorough; so long as we're on the subject, though, people shouldn't parse RegEx with HTML either! ;)

LeguRi 2010-08-11 15:34:30

@Scytale @Richard hahaha I didn't even see that. My bad - will fix :)

Vivin Paliath 2010-08-11 16:07:32

Answer 3

+1 A:

Do not try to rewrite your expression to do this. You won’t succeed and will almost certainly forget about some corner cases. In the best case, this will lead to nasty bugs and in the worst case you will introduce security problems.

Instead, when you’re already using JavaScript and have well-formed code, use a genuine XML parser to loop over the text nodes and only apply your regex to them.

Scytale 2010-08-11 15:30:19

Answer 4

A:

As stated above and many times before, HTML is not a regular language and thus cannot be parsed with regular expressions.

You will have to do this recursively; I'd suggest crawling the DOM object.

Try something like this...

function regexReplaceInnerText(curr_element) {
    if (curr_element.childNodes.length <= 0) { // termination case:
                                               // no children; this is a "leaf node"
        if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br />
            if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space
                                                                     // (you can skip this check if you want)
                var text = curr_element.data;
                text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
                curr_element.data = text;
            }
        }
    } else {
        // recursive case:
        // this isn't a leaf node, so we iterate over all children and recurse
        for (var i = 0; curr_element.childNodes[i]; i++) {
            regexReplaceInnerText(curr_element.childNodes[i]);
        }
    }
}
// then get the element whose children's text nodes you want to be regex'd
regexReplaceInnerText(document.getElementsByTagName("body")[0]);
// or if you don't want to do the whole document...
regexReplaceInnerText(document.getElementById("ElementToRegEx"));

LeguRi 2010-08-11 15:33:15

Answer 5

+1 A:

If you can access that text through the DOM, you can do this:

function fixPunctuation(elem) {
    // check if parameter is a an ELEMENT_NODE
    if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return;
    var children = elem.childNodes, node;
    // iterate the child nodes of the element node
    for (var i=0; children[i]; ++i) {
        node = children[i];
        // check the child’s node type
        switch (node.nodeType) {
        case Node.ELEMENT_NODE:
            // call fixPunctuation if it’s also an ELEMENT_NODE
            fixPunctuation(node);
            break;
        case Node.TEXT_NODE:
            // fix punctuation if it’s a TEXT_NODE
            node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
            break;
        }
    }
}

Now just pass the DOM node to that function like this:

fixPunctuation(document.body);
fixPunctuation(document.getElementById("foobar"));

Gumbo 2010-08-11 15:44:21

You mis-spelt the function name `fixPunctuation` as `fixPunctutation` a few times ;)

LeguRi 2010-08-11 16:05:58

@Richard JP Le Guen: Ah, you’re right, thanks. Fixed that.

Gumbo 2010-08-11 16:31:19

Answer 6

A:

You can use a lookahead to make sure the match isn't occurring inside a tag:

text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

The usual warnings apply regarding CDATA sections, SGML comments, SCRIPT elements, and angle brackets in attribute values. But I suspect your real problems will arise from the vagaries of "plain" text; HTML's not even in the same league. :D

Alan Moore 2010-08-11 22:40:23

I doesn't work. "Test,and" should become "Test, and". I was thinking of lookafter too, but I couldn't get it. Something like looking for "...> anything but < (text to find/replace)".And I think the [^<>]* part above is not necessary.

jcisio 2010-08-13 08:21:25

There more asterisks in there when I tested it, but they disappeared. Try it now.

Alan Moore 2010-08-14 04:00:30

ansaurus

tags:

views:

answers:

Regexp to search/replace only text, not in HTML attribute

related questions