tags:

views:

52

answers:

2

I have been using PHP's DOM to load an html template, modify it and output it. Recently I discovered that self-closing (empty) tags don't include a closing slash, even though the template file did.

e.g.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"`"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
</body>
</html>

becomes:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
</body>
</html>

Is this a bug or a setting, or a doctype issue?

+2  A: 

doctype issue as it's text/html the closing slash isn't needed, you only need closing slash if it is an xhtml doc

noted you've updated to add in the doctype, but PHP dom also looks at that meta tag you've got in there, and content="text/html; charset=utf-8" clearly isn't XML based, it's just text/html :)

aside: DOM api also picks up the charset from there

nathan
i still don't understand why people use xhtml doctype - especially when they then use content-type of text/html to get their site working properly in IE... for 99% of the web, xhtml doesn't offer any advantage to HTML4.01 at the cost of having to improperly implement it (viz. content-type=text/html)
HorusKol
xhtml is XML tool chain compatible, and there has been a huge investment in XML tooling, may not make a difference to browsers, but it sure makes a difference to many other clients and generators (especially if you add xslt etc in to the mix)
nathan
+3  A: 

DOMDocument->saveHTML() takes your XML DOM infoset and writes it out as old-school HTML, not XML. You should not use saveHTML() together with an XHTML doctype, as its output won't be well-formed XML.

If you use saveXML() instead, you'll get proper XHTML. It's fine to serve this XML output to standards-compliant browsers if you give it a Content-Type: application/xhtml+xml header. But unfortunately IE6-8 won't be able to read that, as they can still only handle old-school HTML, under the text/html media type.

The usual compromise solution is to serve text/html and use ‘HTML-compatible XHTML’ as outlined in Appendix C of the XHTML 1.0 spec. But sadly there is no PHP DOMDocument->saveXHTML() method to generate the correct output for this.

There are some things you can do to persuade saveXML() to produce HTML-compatible output for some common cases. The main one is that you have to ensure that only elements defined by HTML4 as having an EMPTY content model (<img>, <br> etc) actually do have empty content, causing the self-closing syntax (<img/>) to be used. Other elements must not use the self-closing syntax, so if they're empty you should put a space in their text content to stop them being so:

<script src="x.js"/>           <-- no good, confuses HTML parser and breaks page
<script src="x.js"> </script>  <-- fine

The other one to look out for is handling of the inline <script> and <style> elements, which are normal elements in XHTML but special CDATA-content elements in HTML. Some /*<![CDATA[*/.../*]]>*/ wrapping is required to make any < or & characters inside them behave mostly-consistently, though note you still have to avoid the ]]> and </ sequences.

If you want to really do it properly you would have to write your own HTML-compatible-XHTML serialiser. Long-term that would probably be a better option. But for small simple cases, hacking your input so that it doesn't contain anything that would come out the other end of an XML serialiser as incompatible with HTML is probably the quick solution.

That or just suck it up and live with old-school non-XML HTML, obviously.

bobince
Thank you for the detailed reply. I have always hated PHP's DOM, however this is the icing on the coffin. I may try some simple regex pre/post processing to alter the input/output with saveXML(). This is not an ideal solution.Does PHP's DOM support HTML 5?
peterjwest
Avoid regex-hacking output HTML at all costs. (But I would say that, wouldn't I?) Writing an XHTML serialiser isn't that bad (XML is way easier to serialise than it is to parse); it'd be slow, but then preparing templates with `DOMDocument` is pretty slow in general. As for HTML5, it will effectively work the same as HTML4. PHP doesn't know about the new HTML5 elements, so if you used any that should be `EMPTY` (eg. `<meter>`) you'd get an invalid end-tag for them.
bobince
Oh wow, [you would](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) say that. Do you know a fast way to prepare templates (HTML or XHTML) in PHP?
peterjwest
PHP is a templating language, isn't it? :-) OK, it's not one without problems, in particular the way it doesn't default to HTML-encoding output, but you can at least write a shortcut function to save putting `echo htmlspecialchars` every time. There are dozens of alternative template systems for PHP, to various tastes.
bobince