Hi, I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
into
[['1,1', '1,2'],
['2,1', '2,2']]
Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
And what the code really looks like is even worse. All I really need out of there is:
<td rowspan="1">Sep 5</td>
Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
would end out like this:
[["Sep 5", "Some event"],
[None, "Some other event"]]
There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).
I was thinking of a process something like this:
- Find table
- Count rows in tables (
len(table.findAll('tr'))
?) - Create list
- Parse table into list (BeautifulSoup syntax???)
- ???
- Profit! (Well, it's a purely internal program, so not really... )