views:

107

answers:

3

Hi,

I'm writing a HTML parser in Flex (AS3) and I need to remove some HTML tags that are not needed.

For example, I want to remove the divs from this code:

           <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>

and end with something like this:

                      <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>

My question is, how can I write a regular expression to remove these unwanted DIVs? Is there a better way to do it?

Thanks in advance.

+2  A: 

You can't match arbitrarily nested constructs with a regular expression because nesting means irregularity. A parser (which you are writing) is the correct tool for this.

Now in this very special case, you could do a

result = subject.replace(/^\s*(<\/?div>)(?:\s*\1)*(?=\s*\1)/mg, "");

(which would simply remove all directly subsequent occurrences of <div> or </div> except the last one), but this is bad in so many ways that I'm afraid it will get me downvoted into oblivion.

To explain:

^           # match start of line
\s*         # match leading whitespace
(</?div>)   # match a <div> or </div>, remember which
(?:\s*\1)*  # match any further <div> or </div>, same one as before
(?=\s*\1)   # as long as there is another one right ahead

Can you count the ways in these this will fail? (Think comments, unmatched <div>s etc.)

Tim Pietzcker
A: 

In my experience, parse complex html with regex only is hell. Regexes are quickly getting out of hand. It is much more robust to extract pieces of information you need (maybe with simple regexes) and assemble them back into simpler document.

alxx
+1  A: 

Assuming that your target HTML is actually valid XML, you can use a recursive function to drag out the non-div bits.

static function grabNonDivContents(xml:XML):XMLList {
    var out:XMLList = new XMLList();
    var kids:XMLList = xml.children();
    for each (var kid:XML in kids) {
        if (kid.name() && kid.name() == "div") {
            var grandkids:XMLList = grabNonDivContents(kid);
            for each (var grandkid:XML in grandkids) {
                out += grandKid;
            }
        } else {
            out += kid;
        }
    }
    return out;
}
SomeJerk
works perfect! thanks. In this case I'm always sure the XML is well formed and I have absolute control over it. So this XML solution is just perfect.
fast-dev