views:

449

answers:

1

I had a problem a week or so ago. Since I think the solution was cool I am sharing it here while I am waiting for an answer to the question I posted earlier. I need to know the relative position for the column headings in a table so I know how to match the column heading up with the data in the rows below. I found some of my tables had the following row as the first row in the table

<!-- Table Width Row -->
<TR style="font-size: 1pt" valign="bottom">
<TD width="60%">&nbsp;</TD> <!-- colindex=01 type=maindata -->
<TD width="1%">&nbsp;</TD>  <!-- colindex=02 type=gutter -->
<TD width="1%" align="right">&nbsp;</TD>    <!-- colindex=02 type=lead -->
<TD width="9%" align="right">&nbsp;</TD>    <!-- colindex=02 type=body -->
<TD width="1%" align="left">&nbsp;</TD> <!-- colindex=02 type=hang1 -->

<TD width="3%">&nbsp;</TD>  <!-- colindex=03 type=gutter -->
<TD width="1%" align="right">&nbsp;</TD>    <!-- colindex=03 type=lead -->
<TD width="4%" align="right">&nbsp;</TD>    <!-- colindex=03 type=body -->
<TD width="1%" align="left">&nbsp;</TD> <!-- colindex=03 type=hang1 -->
<TD width="3%">&nbsp;</TD>  <!-- colindex=04 type=gutter -->
<TD width="1%" align="right">&nbsp;</TD>    <!-- colindex=04 type=lead -->

<TD width="4%" align="right">&nbsp;</TD>    <!-- colindex=04 type=body -->
<TD width="1%" align="left">&nbsp;</TD> <!-- colindex=04 type=hang1 -->
<TD width="3%">&nbsp;</TD>  <!-- colindex=05 type=gutter -->
<TD width="1%" align="right">&nbsp;</TD>    <!-- colindex=05 type=lead -->
<TD width="5%" align="right">&nbsp;</TD>    <!-- colindex=05 type=body -->
<TD width="1%" align="left">&nbsp;</TD> <!-- colindex=05 type=hang1 -->

 </TR>

I thought wow, this will be easy because the data is in the column below type=body. Counting down I knew that in the data rows I would need to get the values in columns [3, 7, 11, 15]. So I set out to accomplish that using this code:

indexComment = souptoGetColIndex.findAll(text=re.compile("type=body"))
indexRow=indexComment[0].findParent()
indexCells=indexRow.findAll(text=re.compile("type=body"))
for each in range(len(indexCells)):
    collist.append(tdlist.index(indexCells[each].previousSibling.previousSibling))

what I got back was collist=[0, 3, 7, 7, 15]

It turns out I think that because cells in the 7th and 11th position looked exactly alike the same index position was returned. I was trying to figure out how to deal with this, clearly I had to make them look different. So what I did was make them look different by first using a readlines to read each line of the file in and change the blank spaces to a random integer.

for each in toGetColIndex:
   newlt.append(each.replace(r"&nbsp;",str(random.randint(1,14567))))

a friend pointed out that I could lower overhead by using this instead

for each in toGetColIndex:
   newlt.append(each.replace(r"&nbsp;",str(toGetColIndex.index(each))))

Nonetheless, each of these approaches gets me a list with the colindex for the location of my headers for each column and to use on the data rows. Note that replace function is missing the blank space since I guess the html is causing it to disappear the actual code uses r"&.n.b.s.p;" without the periods

+1  A: 

The code below produces [3, 7, 11, 15] which is what I understand you seek

from BeautifulSoup import BeautifulSoup
from re import compile

soup = BeautifulSoup(
    '''<HTML><BODY>
    <TABLE>
    <TR style="font-size: 1pt" valign="bottom">
    <TD width="60%"> </TD> <!-- colindex=01 type=maindata -->
    <TD width="1%"> </TD>  <!-- colindex=02 type=gutter -->
    <TD width="1%" align="right"> </TD>    <!-- colindex=02 type=lead -->
    <TD width="9%" align="right"> </TD>    <!-- colindex=02 type=body -->
    <TD width="1%" align="left"> </TD> <!-- colindex=02 type=hang1 -->

    <TD width="3%"> </TD>  <!-- colindex=03 type=gutter -->
    <TD width="1%" align="right"> </TD>    <!-- colindex=03 type=lead -->
    <TD width="4%" align="right"> </TD>    <!-- colindex=03 type=body -->
    <TD width="1%" align="left"> </TD> <!-- colindex=03 type=hang1 -->
    <TD width="3%"> </TD>  <!-- colindex=04 type=gutter -->
    <TD width="1%" align="right"> </TD>    <!-- colindex=04 type=lead -->

    <TD width="4%" align="right"> </TD>    <!-- colindex=04 type=body -->
    <TD width="1%" align="left"> </TD> <!-- colindex=04 type=hang1 -->
    <TD width="3%"> </TD>  <!-- colindex=05 type=gutter -->
    <TD width="1%" align="right"> </TD>    <!-- colindex=05 type=lead -->
    <TD width="5%" align="right"> </TD>    <!-- colindex=05 type=body -->
    <TD width="1%" align="left"> </TD> <!-- colindex=05 type=hang1 -->

     </TR>
    </TABLE> </BODY></HTML>'''
)

tables = soup.findAll('table')
matcher = compile('colindex')

def body_cols(row):
    for i, comment in enumerate(row.findAll(text=matcher)):
        if 'type=body' in comment:
            yield i

for table in soup.findAll('table'):
    index_row = table.find('tr')
    print list(body_cols(index_row))
Florian Bösch