tags:

views:

195

answers:

2

Hi Guys,

Is it possible to configure HTML Tidy in the following way:

Given the html:

lorem ipsum</em> dolar sit amet.</p>

To have it generate

<p><em>lorem ipsum</em> dolar sit amet.</p>

Instead of just stripping out the closing tags?

Many thanks

Matt

+1  A: 

No. HTML Tidy does not provide that option.

You would be expecting the simple tidy parser to infer prior intent.

Determining when a tag must be closed, whether it was intended to be closed at that point or not, can be accomplished by the parser using the rules of html.

Sky Sanders
+1. What I wanted to say in a comment, didnt fit into these 600 chars :), so I wrote another answer.
Philip Daubmeier
+2  A: 

I basically agree with Sky Sanders answer. except for:

You would be expecting the simple tidy parser to infer prior intent.

You could write a parser that provides the described functionality just without having to infer any intent, working just deterministic. One could easily (yeah, more or less easily :) ) write an algorithm that does the job. The idea would be:

Adding closing tags

After all, this can be done with HTML Tidy already, and every browser/parser does it implicitly already (Im not speaking about valid XHTML here):

<div>some <span><em>text</span> here</div>

gets

<div>some <span><em>text</em></span> here</div>

Adding opening tags

We could now go and make some algorithm that analyses the following, beginning at the end of the string and searching reversely:

<div>some <span>text</em></span> here</div>

to produce the following one, because it sees that the em tag is embedded in the span tag.

<div>some <span><em>text</em></span> here</div>

Combining these two

Now we have to write an algorithm that does both adding missing closing and opening tags. Now lets take this html fragment:

<div>some <span>text</em> here</div>

First apply the 'add all missing closing tags' method:

<div>some <span>text</em> here</span></div>

The algorithm is assuming here that every closing and opening tag that comes after <span> is embedded in the span. It only stops if it sees a closing tag for some opening tag that was before the <span>. In this case this is </div>, that had a valid opening tag <div> before. Then apply the same semantics in a reverse search, like described before:

<div>some <span><em>text</em> here</span></div>

et voila.

Does that all make sense?

In my opinion: No. It is technically possible, but not worth the effort. You would have to implement your own parser, together with this pseudo-intelligent methods described above. Additionally this would apply a semantic to html that isnt there anyway: Every browser/parser just ignores isolated closing tags, so why would you want to pay attention to them?

If I couldnt convince you yet, consider the semantics of html:

some <b>text</b> here reads like: "print 'some'. start rendering bold. print 'text'. stop rendering bold. print 'here'."

While:

some text</b> here reads like: "print 'some text'. stop rendering bold." "What? I didnt even start rendering anything bold!? I'll just ignore that..." :)

Philip Daubmeier