views:

99

answers:

2

I am working with html documents and ripping out tables to parse them if they turn out to be the correct tables. I am happy with the results - my extraction process successfully maps row labels and column headings in over 95% of the cases and in the cases it does not we can identify the problems and use other approaches.

In my scanning around the iternet I have come to understand that a browser has a very powerful 'engine' to properly display the contents of htm pages even if the underlying htm is mal-formed. The problems we have with parsing tables have to do with things like not being able to separate the header from the data rows or being able to separate the row labels from one or more of the adjacent data values and then not correctly parsing out adjacent data values. (We might have two data values that get mapped to one column heading instead of the two adjacent column headings. That is if I have a column heading labeled apple and then one labeled banana I might have the value '1125 12345' assigned to the banana (or apple) column heading in the output instead of having the value 1125 assigned to apple and 12345 assigned to banana.

As I said at the beginning- we get it right 95% of the time and we can tell in the output when there is a problem. I am starting to think we have gone as far as we can using logic and inferences from the html to clean these up so I am beginning to wonder if I need a new approach.

Is there a way to harness the 'engine' of a browser to help with this parser. Ultimately if the browser can properly display the columns and rows so they are properly displayed on the screen then there is some technology that handles even when the row and column spans are not consistent (for example).

Thanks for any observations

+2  A: 

Actually, browser engines are deliberately stupid in their parsing of HTML, assuming that what they get is only marginally correct. lxml and BeautifulSoup attempt to mimic this level of stupidity, so they are the correct tools to use.

Ignacio Vazquez-Abrams
+2  A: 

To "harness the 'engine' of a browser", your best bet at this time is no doubt SeleniumRC -- however its main advantage is in handling javascript "just like the browser would" (there are few other options for that); for a table that's simply logically broken though it may "look" OK when rendered, the browser (and therefore Selenium) may be just as helpless as lxml or BeautifulSoup. Still, may be worth your while to try.

Alex Martelli
@Alex thanks so I have to keep developing logic for the edge cases
PyNEwbie