I have a project where I am taking some particularly ugly "live" HTML and forcing it into a formal XML DOM with the HTML Agility Pack. What I would like to be able to do is then query over this with Linq to XML so that I can scrape out the bits I need. I'm using the method described here to parse the HtmlDocument into an XDocument, but when trying to query over this I'm not sure how to handle namespaces. In one particular document the original HTML was actually poorly formatted XHTML with the following tag:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
When trying to query from this document it seems that the namespace attribute is preventing me from doing something like:
var x = xDoc.Descendants("div");
// returns null
Apparently for those "div" tags only the LocalName is "div", but the proper tag name is the namespace plus "div". I have tried to do some research on the issue of XML namespaces and it seems that I can bypass the namespace by querying this way:
var x =
(from x in xDoc.Descendants()
where x.Name.LocalName == "div"
select x);
// works
However, this seems like a rather hacky solution and does not properly address the namespace issue. As I understand it a proper XML document can contain multiple namespaces and therefore the proper way to handle it should be to parse out the namespaces I'm querying under. Has anyone else ever had to do this? Am I just making it way to complicated? I know that I could avoid all this by just sticking with HtmlDocument and querying with XPath, but I would rather stick to what I know (Linq) if possible and I would also prefer to know that I am not setting myself up for further namespace-related issues down the road.
What is the proper way to deal with namespaces in this situation?