ansaurus

Question

How to convert an HTML table to an array in python

Answer 1

+6 A:

Use BeautifulSoup (I recommend 3.0.8). Finding all tables is trivial:

import BeautifulSoup

def get_tables(htmldoc):
    soup = BeautifulSoup.BeautifulSoup(htmldoc)
    return soup.findAll('table')

However, in Python, an array is 1-dimensional and constrained to pretty elementary types as items (integers, floats, that elementary). So there's no way to squeeze an HTML table in a Python array.

Maybe you mean a Python list instead? That's also 1-dimensional, but anything can be an item, so you could have a list of lists (one sublist per tr tag, I imagine, containing one item per td tag).

That would give:

def makelist(table):
  result = []
  allrows = table.findAll('tr')
  for row in allrows:
    result.append([])
    allcols = row.findAll('td')
    for col in allcols:
      thestrings = [unicode(s) for s in col.findAll(text=True)]
      thetext = ''.join(thestrings)
      result[-1].append(thetext)
  return result

This may not yet be quite what you want (doesn't skip HTML comments, the items of the sublists are unicode strings and not byte strings, etc) but it should be easy to adjust.

Alex Martelli 2010-05-20 02:39:02

Beautiful soup is great and easy! Also try using lxml+xpath if looking for more speed.

Jweede 2010-05-20 02:41:57

wow, that worked perfectly. Thank you!

Zach 2010-05-20 03:58:27

@user, always glad to help. If it's so good an answer to your question, you should "accept" it (by clicking the checkmark-shaped icon below the number of votes on the answer's upper left) -- that's a key part of SO's etiquette!-)

Alex Martelli 2010-05-20 04:05:18

One more question: what if the table has a header row?

Zach 2010-05-20 04:08:17

That would have `th` items rather than `td`, so the corresponding sublist in `result` would be empty -- you could just add `if not result[-1]: del result[-1]` after the `for col` loop to remove such empty rows, for example.

Alex Martelli 2010-05-20 04:15:23

what if I'd like to include those header rows in the list?

Zach 2010-05-20 04:31:05

Then you'll need to look for `th` as well as `td`.

Alex Martelli 2010-05-20 04:35:23

Here's what I'm using, to include the header rows:allcols = row.findAll(re.compile('(td)|(th)'))

Zach 2010-05-20 19:02:45

@user, yep, good idea, "finding" a RE rather than just a string is quite a good way to "look for th as well as td" as I recommended.

Alex Martelli 2010-05-20 22:45:16

Answer 2

A:

A +1 to the question-asker and another to the god of Python.
Wanted to try this example using lxml and CSS selectors.
Yes, this is mostly the same as Alex's example:

import lxml.html
html = lxml.html.fromstring('''<html><body>\
<table width="600">
    <tr>
        <td width="50%">0,0,0</td>
        <td width="50%">0,0,1</td>
    </tr>
    <tr>
        <td>0,1,0</td>
        <td>0,1,1</td>
    </tr>
</table>
<table>
    <tr>
        <td>1,0,0</td>
        <td>1,<blink>0,</blink>1</td>
        <td>1,0,2</td>
        <td><bold>1</bold>,0,3</td>
    </tr>
</table>
</body></html>''')

tbl = []
rows = html.cssselect("tr")
for row in rows:
  tbl.append(list())
  for td in row.cssselect("td"):
    tbl[-1].append(unicode(td.text_content()))

pprint(tbl)
#[[u'0,0,0', u'0,0,1'],
# [u'0,1,0', u'0,1,1'],
# [u'1,0,0', u'1,0,1', u'1,0,2', u'1,0,3']]

Adam Bernier 2010-05-20 03:58:19

It is weird to use `list()` instead of the plain `[]`.

J.F. Sebastian 2010-05-20 05:03:46

@J.F. yeah, I suppose so. Thanks for the comment, and for all of your great answers :-) Keep up the good work.

Adam Bernier 2010-05-20 05:08:51

ansaurus

tags:

views:

answers:

How to convert an HTML table to an array in python

related questions