ansaurus

Question

Get HTML links within a specified <table> using minidom

Answer 1

A:

I think that you want to first find the TABLE element then call getElemenetByTagName on it. That should return all a elements that are children on the table element. Also, double check that your HTML is XHTML; minidom is meant to parse XML, not HTML.

Adam Crossland 2010-01-12 18:50:11

Answer 2

+1 A:

The problem is that minidom is a non-external-entity-reading XML parser. That means it doesn't even look at the DTD, so it doesn't know that in HTML the attribute with the name id corresponds to an ID schema type.

A further consequence of this is that minidom won't know about the HTML-specific entities like é that are defined in the XHTML doctype, so you may lose text that way.

If you don't care about this, you can continue using minidom and using an alternative way to get at the table, involving getElementsByTagName and checking element.id manually. (You could hack up your own getElementById function to do it the slow way.)

Or you could use an XML parser that does allow external entities such as pxdom. However this means the parser will have to fetch and parse the DTD from W3 each time, which will be unpleasantly slow.

Or you could go for an HTML parser, which has the HTML entities and ID-nesses built in, such as BeautifulSoup. This might be a better idea when you are dealing with real-world HTML pages served as text/html, which though they may claim to be XHTML often includes naughty bits that aren't well-formed.

bobince 2010-01-12 19:09:08

Ah, I didn't even think of getting the element 'id' of the table - that actually works pretty well. So what's the etiquette here? Select this as the answer now, or leave it open for a week to get some points?

Nicholas Palko 2010-01-12 19:36:21

[Points can be had after an answer is selected, so no worries there.]

bobince 2010-01-13 23:54:47

ansaurus

tags:

views:

answers:

Get HTML links within a specified <table> using minidom

related questions