I am working with html documents and ripping out tables to parse them if they turn out to be the correct tables. I am happy with the results - my extraction process successfully maps row labels and column headings in over 95% of the cases and in the cases it does not we can identify the problems and use other approaches.
In my scanning around the iternet I have come to understand that a browser has a very powerful 'engine' to properly display the contents of htm pages even if the underlying htm is mal-formed. The problems we have with parsing tables have to do with things like not being able to separate the header from the data rows or being able to separate the row labels from one or more of the adjacent data values and then not correctly parsing out adjacent data values. (We might have two data values that get mapped to one column heading instead of the two adjacent column headings. That is if I have a column heading labeled apple and then one labeled banana I might have the value '1125 12345' assigned to the banana (or apple) column heading in the output instead of having the value 1125 assigned to apple and 12345 assigned to banana.
As I said at the beginning- we get it right 95% of the time and we can tell in the output when there is a problem. I am starting to think we have gone as far as we can using logic and inferences from the html to clean these up so I am beginning to wonder if I need a new approach.
Is there a way to harness the 'engine' of a browser to help with this parser. Ultimately if the browser can properly display the columns and rows so they are properly displayed on the screen then there is some technology that handles even when the row and column spans are not consistent (for example).
Thanks for any observations