views:

820

answers:

3

Hi everybody.

I have an HTML file (encoded in utf-8). I open it with codecs.open(). The file architecture is:

<html>
// header
<body>
  // some text
  <table>
    // some rows with cells here
    // some cells contains tables
  </table>
  // maybe some text here
  <table>
    // a form and other stuff
  </table>
  // probably some more text
</body></html>

I need to retrieve only first table (discard the one with form). Ommit all input before first and after corresponding . Some cells contains also paragrahs, bolds and scripts. There is no more than one nested table per row of main table.

How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.

I tried HTMLParse, PyParse and re module, but can't get this working. I'm quite new to Python, and btw. that's my first post on StackOverflow.

+4  A: 

Try beautiful soup

In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)

ʞɔıu
+1 for "computer sciencey reasons" :)
Paul Fisher
thanks for the answer. looks like it will solve my problem :)
paffnucy
I spent a half of night trying to understand B.Soup. and did - it workos for me. thanks again!
paffnucy
+2  A: 

If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.

The XPath for pulling out the first table would be "//table[1]".

Nat
+2  A: 

You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...

import lxml.html

def process_row(row):
    for cell in row.xpath('./td'):
       inner_tables = cell.xpath('./table')
       if len(inner_tables) < 1:
           yield cell.text_content()
       else:
           yield [process_table(t) for t in inner_tables]

def process_table(table):
    return [process_row(row) for row in table.xpath('./tr')]

html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]

data = process_table(first_table))
drdaeman
thanks. can't give you +1 (less than 15 'exp' :P).
paffnucy
(months later) just the recipe I need, if I survive http://lsimons.wordpress.com/2008/08/31/how-to-install-lxml-python-module-on-mac-os-105-leopard
Denis