views:

83

answers:

1

I've got some XML (valid XHTML) that looks like this:

<html>
<head>
<script type="text/javascript"><![CDATA[
function change_header()
{
document.getElementById("myHeader").innerHTML="Nice day!";
}
]]></script>
</head>

<body>

<h1 id="myHeader">Hello World!</h1>
<button onclick="change_header()">Change text</button>

</body>
</html>

And I'm trying to get the #myHeader node using docment.GetElementById("myHeader") but it always returns null. Why?

I'm guessing it doesn't recognize the id attribute as the id attribute without a Dtd or something? If that's the case, how can I get it to use an Html Dtd?

+1  A: 

It's because XmlDocument knows nothing about what an id means. You need to include a DTD in your XHTML document. Just put the following in the beginning of your html file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

Example:

string html = @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""&gt;&lt;html&gt;&lt;body&gt;&lt;div id=""foo"">some content</div></body></html>";
XmlDocument document = new XmlDocument();
document.LoadXml(html);
XmlElement div = document.GetElementById("foo");

Notice that this might be a little slower because the DTD needs to be downloaded.

Darin Dimitrov
The document is coming from the web in the form of a stream. Is there another way to set the doctype?
Mark
I am afraid you will need to load it into memory, append the correct DTD and then load it into an XmlDocument. Of course if you intend to parse HTML I would recommend you using [Html Agility Pack](http://htmlagilitypack.codeplex.com/). Using XmlDocument for parsing invalid web pages (ones without DTD for example) is a perilous task.
Darin Dimitrov
Trying `SgmlReader` instead. Wasn't too fond of HtmlAgilityPack.
Mark
Yeah, it's also a good way.
Darin Dimitrov
I'm marking this as the accepted answer to close this thread, but I went to look for the "id" attribute instead because it's easier.
Mark