views:

31

answers:

0

I am working with a large set of html documents. One of my tasks is to extract all text from the documents. I have gotten pretty far but now I am stumped because of the use of tables as containers / formatting structures for information that is not numeric in nature

My goal is to ignore - leave behind - not extract the 'table' if it is a table of numeric fields

I am getting ready to implement a brute rule based approach by taking a table and if more than some percentage of the td.text_content() can be classfied as digits I will decide that the table is a table of numeric values

I am wondering if someone else can suggest a better approach