views:

151

answers:

3

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I might want to extract the contents of Chapter 5 from every document so I can analyze the content of the chapter. Initially I thought this would be easy but it turns out that the authors might use a set of non-nested tables throughout the document to hold the content so that Chapter n could be displayed using td tags inside a table. Or they might use other elements such as p tags H tags, div tags or any other block level element.

After trying repeatedly to use lxml to help me identify the beginning and end of each chapter I have determined that it is a lot cleaner to use a regular expression because in every case, no matter what the enclosing html element is the chapter label is always in the form of

>Chapter #

It is a little more complicated in that there might be some white space or non-breaking space represented in different ways (  or   or just spaces). Nonetheless it was trivial to write a regular expression to identify the beginning of each section. (The beginning of one section is the end of the previous section.)

But now I want to use lxml to get the text out. My thought is that I have really no choice but to walk along my string to find the close tag for the element that encloses the text I am using to find the relevant section.

That is here is one example where the element holding the Chapter name is a div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.&#160;&#160;&#160;Our Beginnings.</font></div>

So I am imagining that I would begin at the location where I found the match for chapter 1 and set up a regular expressions to find the next

</div|</td|</p|</h1 . . .

So at this point I have identified the type of element holding my chapter heading

I can use the same logic to find all of the text that is within that element that is set up a regular expression to help me mark from

>Chapter 1.&#160;&#160;&#160;Our Beginnings.<

So I have identified where my Chapter 1 begins

I can do the same for chapter 2 (which is where Chapter 1 ends)

Now I am imagining that I am going to snip the document beginning at the opening of the element that I identified as the element the indicates where chapter 1 begins and ending just before the opening of the element that I identified as the element that indicates where Chapter 2 begins. The string that I have identified will then be fed to lxml to use its power to get the content.

I am going to all of this trouble because I have read over and over - never use a regular expression to extract content from html documents and I have not hit on a way to be as accurate with lxml to identify the starting and ending locations for the text I want to extract. For example, I can never be certain that the subtitle of Chapter 1 is Our Beginnings it could be Our Red Canary. Let me say that I spent two solid days trying with lxml to be confident that I had the beginning and ending elements and I could only be accurate <60% of the time but a very short regular expression has given me better than 95% success.

I have a tendency to make things more complicated than necessary so I am wondering if anyone has seen or solved a similar problems and if they had an approach (not the details mind you) that they would like to offer.

+1  A: 

The simplest thing it sounds like you could possibly do is iterate over tree.getroot().iterdescendants() looking for a node with node.text that matches your desired regular expression. From that point, you can pass the node to a function that uses some ad-hoc heuristics to determine where the text is. (Maybe if iterdescendants on root is too slow you can use your regex approach and dive into etree to try and find a f(text_position) -> node function.)

For example, if you find that the target was a //tr/td, you can pass it to some table-text-finding subroutine that looked into the next td in node.parent() to see if it has text that makes sense (approximately chapter-length, containing certain words, whatever). Likewise, you can make up some heuristics for finding the data in other tags like div and p. If you find yourself in an unknown tag like font you can try bubbling up a limited number of levels to find something you know how to handle -- you have to be cautious not to bubble up too far, or I imagine you might accidentally retrieve text from another chapter.

The crux of the problem seems to be that you're mining data that's not presented programmatically in a programmatic way -- in these cases, human interaction is usually necessary to some degree.

cdleary
+2  A: 

Sometimes there is not a straight path to getting the content when dealing with poorly or inconsistently written HTML.

You might want to look at using lynx or one of the text-based browsers to dump the page content, either into a file, or to pipe it into your code, and then process it. Or, you can use lxml to load and parse the page, then extract the text using text_content() and go after the chapters via regex.

Like they say, GIGO - garbage in, garbage out, and it's our job as developers to spin that garbage into gold. Doing so can get pretty messy.

Greg
+1  A: 

As I feared there is no systematic way to use lxml to identify and extract what I need. O h well I appreciate everyone chiming in. Note-this is not the fault of lxml, it is the fault of the inconsistent html coding. For instance. Because a chapter is a reasonable division of a document all the content in one chapter should be wrapped in some type of element. Probably the most flexible would be a div tag with the subsequent div being the next chapter. This would make a chapter a branch of the tree. Unfortunately while approximately 20% of the documents might be that well structured the others are not.

I could test for each type of element that should hold my content (div, p) and grab all of its children and all of its siblings until I get to the next element of that type that has information that alerts me that we are at the end of the section (beginning of the next section). But this seems like too much work when I am good 95% of the time or more with a regular expression.

Thanks for all of the answers and comments as always I learnded from them.

PyNEwbie