ansaurus

Question

BeautifulSoup or regex HTML table to data structure?

Answer 1

+2 A:

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

Meitham 2010-09-16 14:41:34

lxml has been a lot easier to use so far.

Wayne Werner 2010-09-16 15:45:03

Answer 2

A:

You'll probably need to identify the table with some attrs, id or name.

from BeautifulSoup import BeautifulSoup

data = """
<table>
<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>
</table>
"""

soup = BeautifulSoup(data)

for t in soup.findAll('table'):
    for tr in t.findAll('tr'):
        print [td.contents for td in tr.findAll('td')]

Edit: What should do the program if there're multiple links?

Ex:

<td><a href="#">A</a> B <a href="#">C</a></td>

razpeitia 2010-09-16 15:41:10

ansaurus

tags:

views:

answers:

BeautifulSoup or regex HTML table to data structure?

related questions