views:

92

answers:

1

=========================================================================

EDIT: I'm using node.js, so I don't have access to the DOM, and parsing with an HTML parser is not an option (it's not efficient enough to justify parsing through such a small amount of text)

=========================================================================

First off, I know. HTML + Regex = fail. However, I just need it to remove all tags with attributes.

Here's what I have so far:

    exports.strip_tags = function(input, allowed) {
      // Strips HTML and PHP tags from a string
   allowed = (((allowed || "") + "")
     .toLowerCase()
     .match(/<[a-z][a-z0-9]*>/g) || [])
     .join('');
      var tags = /<\/?([a-z][a-z0-9]*)\b[^>]>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
      return input.replace(commentsAndPhpTags, '').replace(tags, function($0, $1){
        return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
    }

Any chance someone know's how to change up one of these regex's to make this remove what I need it to?

To clarify: This function should remove all tags with attributes, keep only the tags that are allowed (without attributes), and output the result.

A: 

Convert it to XHTML and then use xpath.

HTML->XHTML tools:

As you said.... HTML + Regex = fail

Abe Miessler