views:

56

answers:

1

Hi,

Can anyone tell me how i can get the table in a HTML page which has a the most rows? I'm using BeautifulSoup.

There is one little problem though. Sometimes, there seems to be one table nested inside another.

<table>
    <tr>
        <td>
            <table>
                <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
                <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
                <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
            </table>
        <td>
    </tr>
</table>

When the table.findAll('tr') code executes, it would count all the child rows for the table and the rows for the nested table under it. The parent table has just one row but the nested table has three and I would consider that to be the largest table. Below is the code that I'm using to dig out the largest table currently but it doesn't take the aforementioned scenario into consideration.

soup = BeautifulSoup(html)

#Get the largest table
largest_table = None
max_rows = 0
for table in soup.findAll('table'):
    number_of_rows = len(table.findAll('tr'))
    if number_of_rows > max_rows:
        largest_table = table
        max_rows = number_of_rows

I'm really lost with this. Any help guys?

Thanks in advance

A: 

Calculate number_of_rows like that:

number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table))
zifot