views:

78

answers:

4

I'm still learning regex (obviously) and i can't figure it out, and i want to do it the right way rather than doing it the long way. How can I:

Find all <p> or </p> and replace with a \n except the first <p> and last </p> in which case replace with nothing, just remove, and for <br>, <br /> and <br/> replace with \n also.

With Regex OR something else. I'm getting this from a jQuery $.get() return. So, please don't flame me about it, I just don't know how to do it.

A: 

Javascript has rather nice tools for dealing with an xml (or xhtml) DOM. Use those.

Eric Mickelsen
or y'know.. jquery or prototype or something.
Mark
OK, so im getting a massive string of text with <p>s and <br>s, how do I convert it with jQuery to remove all those?
Oscar Godson
A: 

In Regex perspective, to make the first <p> become an exception, you must identify a pattern which makes the first <p> fails. For example, if text before first <p> is abcxyz, that is, abcxyz<p>, then you search every <p> which is not preceded by abcxyz, so that the first <p> doesn't match. Using regex, it becomes: (?<!abcxyz)<p>

To make the last </p> become an exception, you must identify a pattern which makes the last </p> fails. For example, if text after last </p> is abcxyz, that is, </p>abcxyz, then you search every </p> which is not followed by abcxyz, so that the last </p> doesn't match. Using regex, it becomes: </p>(?!abcxyz)

Although JavaScript support positive and negative look-ahead, unfortunately, JavaScript regex doesn't support neither positive nor negative look-behind. Indeed, there are some dirty tricks to mimic look-behind in JavaScript, however, not all look-behind construct can be mimicked.

Thus, if possible, try to identify a pattern which makes the first <p> fails, but use negative look-ahead.

To replace the first <p> and the last </p> with nothing, you can inverse the logic we use above, and you have to do this in separate step.

To replace <br>, <br />, <br/> with \n, search for: <br\s*\/?>, and replace with \n.

Vantomex
A: 

One way to do this would be to allow the browser to do it for you. In IE and WebKit, you could assign your HTML as the innerHTML of a <div> and get its innerText. However, that won't work in Firefox or Opera. Here's a slightly bizarre use of the Selection object that will do it:

function getInnerText(html) {
    var text = "";
    var div = document.createElement("div");
    div.innerHTML = html;

    document.body.appendChild(div);
    if (typeof window.getSelection != "undefined") {
        var sel = window.getSelection();
        sel.removeAllRanges();
        var range = document.createRange();
        range.selectNodeContents(div);
        sel.addRange(range);
        text = sel.toString();
        sel.removeAllRanges();
    } else if (document.body.createTextRange != "undefined") {
        var range = document.body.createTextRange();
        range.moveToElementText(div);
        text = range.text;
    }
    document.body.removeChild(div);
    return text.replace(/\r\n/g, "\n").replace(/\r/g, "\n");
}
Tim Down