I basically agree with Sky Sanders answer. except for:
You would be expecting the simple tidy parser to infer prior intent.
You could write a parser that provides the described functionality just without having to infer any intent, working just deterministic. One could easily (yeah, more or less easily :) ) write an algorithm that does the job. The idea would be:
Adding closing tags
After all, this can be done with HTML Tidy already, and every browser/parser does it implicitly already (Im not speaking about valid XHTML here):
<div>some <span><em>text</span> here</div>
gets
<div>some <span><em>text</em></span> here</div>
Adding opening tags
We could now go and make some algorithm that analyses the following, beginning at the end of the string and searching reversely:
<div>some <span>text</em></span> here</div>
to produce the following one, because it sees that the em
tag is embedded in the span
tag.
<div>some <span><em>text</em></span> here</div>
Combining these two
Now we have to write an algorithm that does both adding missing closing and opening tags. Now lets take this html fragment:
<div>some <span>text</em> here</div>
First apply the 'add all missing closing tags' method:
<div>some <span>text</em> here</span></div>
The algorithm is assuming here that every closing and opening tag that comes after <span>
is embedded in the span
. It only stops if it sees a closing tag for some opening tag that was before the <span>
. In this case this is </div>
, that had a valid opening tag <div>
before. Then apply the same semantics in a reverse search, like described before:
<div>some <span><em>text</em> here</span></div>
et voila.
Does that all make sense?
In my opinion: No. It is technically possible, but not worth the effort. You would have to implement your own parser, together with this pseudo-intelligent methods described above. Additionally this would apply a semantic to html that isnt there anyway: Every browser/parser just ignores isolated closing tags, so why would you want to pay attention to them?
If I couldnt convince you yet, consider the semantics of html:
some <b>text</b> here
reads like: "print 'some'. start rendering bold. print 'text'. stop rendering bold. print 'here'."
While:
some text</b> here
reads like: "print 'some text'. stop rendering bold." "What? I didnt even start rendering anything bold!? I'll just ignore that..." :)