ansaurus

Question

Answer 1

+4 A:

In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)

ʞɔıu 2009-06-03 14:07:04

+1 for "computer sciencey reasons" :)

Paul Fisher 2009-06-03 14:15:31

thanks for the answer. looks like it will solve my problem :)

paffnucy 2009-06-03 14:58:07

I spent a half of night trying to understand B.Soup. and did - it workos for me. thanks again!

paffnucy 2009-06-04 10:37:24

Answer 2

+2 A:

If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.

The XPath for pulling out the first table would be "//table[1]".

Nat 2009-06-03 14:13:23

Answer 3

+2 A:

You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...

import lxml.html

def process_row(row):
    for cell in row.xpath('./td'):
       inner_tables = cell.xpath('./table')
       if len(inner_tables) < 1:
           yield cell.text_content()
       else:
           yield [process_table(t) for t in inner_tables]

def process_table(table):
    return [process_row(row) for row in table.xpath('./tr')]

html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]

data = process_table(first_table))

drdaeman 2009-06-03 14:29:27

thanks. can't give you +1 (less than 15 'exp' :P).

paffnucy 2009-06-03 15:04:17

(months later) just the recipe I need, if I survive http://lsimons.wordpress.com/2008/08/31/how-to-install-lxml-python-module-on-mac-os-105-leopard

Denis 2009-12-25 16:45:05

ansaurus

tags:

views:

answers:

How to extract nested tables from HTML?

related questions