Does anyone have a good way of finding if a string contains malformed XHTML using Javascript?
Since my page allows 'user' generated XHTML returns (the users can be trusted) and injects it into the DOM, I want a way to check if there are unclosed or overly closed tags, and encode them as <
and >
so that it will simply display the errors as text. This way all valid xhtml will still be displayed, and the invalid will simply become text nodes, allowing the script to at least continue with the errors.
Here's a similar method I made, which is rather crude. It has a different purpose (simply stripping all valid xhtml tags and leaving the rest. It works by recursively selecting the inner most tags and stripping them out.
stripHTML: function(html) {
var validXHTML = /<(\S+).*>(.*?)<\/\1>/i;
var validSelfClose = /<(input|img|br|hr)[^>]*\/>/gi;
html = html.replace(validSelfClose, '');
if(validXHTML.test(html)) {
var loc = html.search(validXHTML);
var str = html.match(validXHTML);
html = html.substr(0, loc) +
strings.addPunctuation(html.substr(loc, str[0].length).replace(validXHTML, '$2')) +
html.substr((loc+str[0].length), html.length);
if(validXHTML.test(html)) {
html = strings.stripHTML(html);
} else {
return html;
}
}
return html;
}
Feel free to improve the above, or answer the actual question.
Update
My idea for a simple way to at least accommodate most cases is this:
encode all > and < not that close or open nothing,
change all tag-names inside < > to lowercase
working recursively, start with the inner-most tags, change them from lowercase to upper case <li>something</li>
becomes <LI>something</LI>
after recursion finishes, strip out all other > and <
switch all uppercase tags back to lowercase
Are any problems immediately foreseen, other than the fact it will take a fair amount of time?