ansaurus

Question

Best way to get back to using the power of lxml after having to use a regex to find something in an html document

Answer 1

+1 A:

The simplest thing it sounds like you could possibly do is iterate over tree.getroot().iterdescendants() looking for a node with node.text that matches your desired regular expression. From that point, you can pass the node to a function that uses some ad-hoc heuristics to determine where the text is. (Maybe if iterdescendants on root is too slow you can use your regex approach and dive into etree to try and find a f(text_position) -> node function.)

For example, if you find that the target was a //tr/td, you can pass it to some table-text-finding subroutine that looked into the next td in node.parent() to see if it has text that makes sense (approximately chapter-length, containing certain words, whatever). Likewise, you can make up some heuristics for finding the data in other tags like div and p. If you find yourself in an unknown tag like font you can try bubbling up a limited number of levels to find something you know how to handle -- you have to be cautious not to bubble up too far, or I imagine you might accidentally retrieve text from another chapter.

The crux of the problem seems to be that you're mining data that's not presented programmatically in a programmatic way -- in these cases, human interaction is usually necessary to some degree.

cdleary 2010-03-10 23:38:02

Answer 2

+2 A:

Sometimes there is not a straight path to getting the content when dealing with poorly or inconsistently written HTML.

You might want to look at using lynx or one of the text-based browsers to dump the page content, either into a file, or to pipe it into your code, and then process it. Or, you can use lxml to load and parse the page, then extract the text using text_content() and go after the chapters via regex.

Like they say, GIGO - garbage in, garbage out, and it's our job as developers to spin that garbage into gold. Doing so can get pretty messy.

Greg 2010-03-10 23:58:25

Answer 3

+1 A:

As I feared there is no systematic way to use lxml to identify and extract what I need. O h well I appreciate everyone chiming in. Note-this is not the fault of lxml, it is the fault of the inconsistent html coding. For instance. Because a chapter is a reasonable division of a document all the content in one chapter should be wrapped in some type of element. Probably the most flexible would be a div tag with the subsequent div being the next chapter. This would make a chapter a branch of the tree. Unfortunately while approximately 20% of the documents might be that well structured the others are not.

I could test for each type of element that should hold my content (div, p) and grab all of its children and all of its siblings until I get to the next element of that type that has information that alerts me that we are at the end of the section (beginning of the next section). But this seems like too much work when I am good 95% of the time or more with a regular expression.

Thanks for all of the answers and comments as always I learnded from them.

PyNEwbie 2010-03-17 23:47:51

ansaurus

tags:

views:

answers:

Best way to get back to using the power of lxml after having to use a regex to find something in an html document

related questions