views:

163

answers:

2

Does anyone have a good way of finding if a string contains malformed XHTML using Javascript?

Since my page allows 'user' generated XHTML returns (the users can be trusted) and injects it into the DOM, I want a way to check if there are unclosed or overly closed tags, and encode them as < and > so that it will simply display the errors as text. This way all valid xhtml will still be displayed, and the invalid will simply become text nodes, allowing the script to at least continue with the errors.


Here's a similar method I made, which is rather crude. It has a different purpose (simply stripping all valid xhtml tags and leaving the rest. It works by recursively selecting the inner most tags and stripping them out.

stripHTML: function(html) {
   var validXHTML = /<(\S+).*>(.*?)<\/\1>/i;
   var validSelfClose = /<(input|img|br|hr)[^>]*\/>/gi;

   html = html.replace(validSelfClose, '');

   if(validXHTML.test(html)) {
    var loc = html.search(validXHTML);
    var str = html.match(validXHTML);
    html = html.substr(0, loc) + 
      strings.addPunctuation(html.substr(loc, str[0].length).replace(validXHTML, '$2')) + 
      html.substr((loc+str[0].length), html.length);

    if(validXHTML.test(html)) { 
     html = strings.stripHTML(html);
    } else {
     return html;
    }
   }
   return html;
}


Feel free to improve the above, or answer the actual question.


Update

My idea for a simple way to at least accommodate most cases is this:

encode all > and < not that close or open nothing,

change all tag-names inside < > to lowercase

working recursively, start with the inner-most tags, change them from lowercase to upper case <li>something</li> becomes <LI>something</LI>

after recursion finishes, strip out all other > and <

switch all uppercase tags back to lowercase

Are any problems immediately foreseen, other than the fact it will take a fair amount of time?

A: 

I do this on the server with HTMLTidy

htmltidy -asxhtml
SpliFF
Seems like a good project, but I can't use this as part of an xhtml page.
Ian Elliott
but since you've already said you're using XHR in another comment can't you just post the malformed (x)html to your own tidy.cgi? htmltidy can fix almost anything and what it can't fix your script probably wouldn't fare much better. Sure it adds maybe 2 seconds to the submit/save action but is that going to really be an issue?
SpliFF
btw, there is a project called jTidy which can probably run "on site" as a Java applet but I don't think the project is actively maintained and I haven't used it.
SpliFF
Well it's part of a xhtml+voice application, which already calls a cgi I script to interpret a haskell program. The entire return has to be 'verified' before I allow it to be parsed, which has to be done with a synchronous ajax request or else the vxml initiates before it has anything to ever output. Adding two seconds to this will hang the browser for an additional two seconds, not good! Also I must contain this within one file :( Constraints weren't mine to be made, sadly.
Ian Elliott
2 seconds is pure speculation and was meant to describe the HTML round-trip cost (since tidy should finish in milliseconds). If you're already round-tripping the data then you can add a tidy stage to the output/verifcation process. In eality it could end up being quicker than a pure JS approach, especially if you're relying on regex. Test it and see.
SpliFF
# time tidy some.htmlreal 0m0.006suser 0m0.000ssys 0m0.010s
SpliFF
That's 60ms it took to convert and output 30 lines of HTML as XHMTL
SpliFF
A: 

So is the HTML generation also happening on client side? Best is to validate the generated markup at the source itself.

If not, perhaps there is a way to program the W3C validator.

http://validator.w3.org/#validate_by_input

also see, http://www.w3.org/QA/Tools/

Sesh
The html is generated after load, by grabbing it from a script file using xmlhttp. I have to be able to validate it on site, sending it off to w3c and waiting for a response won't due. Not to mention I just need to validate the tags, not the document, as w3c will always return an input of tags as invalid.
Ian Elliott