ansaurus

Question

Is there a more Pythonic way to merge two HTML header rows with colspans?

Answer 1

+1 A:

Maybe look at the zip function for parts of the problem:

>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]

>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>>

enumerate can help avoid some of the complex indexing with counters:

>>> diff_list = []
>>> for place, header in enumerate(short_header):
    diff_list.append(abs(span_short[place] - span_long[place]))

>>> for place, num in enumerate(diff_list):
    if num:
     new_shortlist.extend(short_header[place] for item in range(num+1))
    else:
     new_shortlist.append(short_header[place])


>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... 
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

Also more pythonic naming may add clarity:

    for each in range(len(short_header)):
        sum_span_long += span_long[long_header_count]
        sum_span_short += span_short[each]
        span_diff = sum_span_short - sum_span_long
        if not span_diff:
            combined_header.append...

bvmou 2008-11-10 07:25:44

What "conventions" did you change when reposting the original code?

S.Lott 2008-11-10 11:15:32

Used PEP 8 names, change a=a+1 to a += 1, just make it conform to recommended style

bvmou 2008-11-10 19:39:29

@bvmou: Here's the point -- it's kind of a long post for that small change -- a change so small I couldn't detect it and had to ask.

S.Lott 2008-11-10 21:06:12

Thanks bvmou-this update makes quite a bit of sense.

PyNEwbie 2008-11-11 06:44:42

Answer 2

A:

Well thanks. Unless I am missing something, zipping doesn't work unless we have two lists of equal length. The answer is not the same as mine, and mine is what I need. I hope that doesn't sound snippy. I am familiar with the zip function. The data will be spread out in the number of columns that are in the sum(any_Header). When the len(long_Header)=len(shortHeader) and the colspans are equal I zip the two together.

2008-11-10 07:38:31

Sorry, zip does not require two lists of equal length. Try it. Answer being different is relevant. Post that as a COMMENT, not a new answer. Feel free to delete this answer, since it isn't an actual answer.

S.Lott 2008-11-10 11:16:36

Or, if you can't comment, then edit your original question.

S.Lott 2008-11-10 14:26:48

Answer 3

+3 A:

Here is a modified version of your algorithm. zip is used to iterate over short lengths and headers and a class object is used to count and iterate the long items, as well as combine the headers. while is more appropriate for the inner loop. (forgive the too short names).

class collector(object):
    def __init__(self, header):
        self.longHeader = header
        self.combinedHeader = []
        self.longHeaderCount = 0
    def combine(self, shortValue):
        self.combinedHeader.append(
            [self.longHeader[self.longHeaderCount]+' '+shortValue] )
        self.longHeaderCount += 1
        return self.longHeaderCount

def main():
    longHeader = [ 
       '','','bananas','','','','','','','','','','trains','','planes','','','','']
    shortHeader = [
    '','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
    spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
    spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
    sumSpanLong=0
    sumSpanShort=0

    combiner = collector(longHeader)
    for sLen,sHead in zip(spanShort,shortHeader):
        sumSpanLong += spanLong[combiner.longHeaderCount]
        sumSpanShort += sLen
        while sumSpanShort - sumSpanLong > 0:
            combiner.combine(sHead)
            sumSpanLong += spanLong[combiner.longHeaderCount]
        combiner.combine(sHead)

    return combiner.combinedHeader

gimel 2008-11-10 09:05:10

+1: introduce objects

S.Lott 2008-11-10 11:14:50

Now to introduce generators...

gimel 2008-11-10 11:18:17

Answer 4

A:

Sorry S Lott I can't comment as my reputation is zero I have to wake up to understand your response

2008-11-10 13:19:51

Answer 5

+2 A:

You've actually got a lot going on in this example.

You've "over-processed" the Beautiful Soup Tag objects to make lists. Leave them as Tags.
All of these kinds of merge algorithms are hard. It helps to treat the two things being merged symmetrically.

Here's a version that should work directly with the Beautiful Soup Tag objects. Also, this version doesn't assume anything about the lengths of the two rows.

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

S.Lott 2008-11-10 13:20:02

Answer 6

A:

I guess I am going to answer my own question but I did receive a lot of help. Thanks for all of the help. I made S.LOTT's answer work after a few small corrections. (They may be so small as to not be visible (inside joke)). So now the question is why is this more Pythonic? I think I see that it is less denser / works with the raw inputs instead of derivations / I cannot judge if it is easier to read ---> though it is easy to read

S.LOTT's Answer Corrected

row1=headerCells[0]
row2=headerCells[1]

i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
    if i1 == len(row1):
        result.append( ' '.join(row1[i1]) )
        i2 += 1
    elif i2 == len(row2):
        result.append( ' '.join(row2[i2]) )
        i1 += 1
    else:
        if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")):
            c1= int(row1[i1].get("colspan","1"))
            while c1 != int(row2[i2].get("colspan","1")): 
                txt1= ' '.join(row1[i1])  # needed to add when working adjust opposing case
                txt2= ' '.join(row2[i2])     # needed to add  when working adjust opposing case
                result.append( txt1 + " " + txt2 )  # needed to add when working adjust opposing case
                print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1
                c1 += 1
                i1 += 1    # Is this the problem it

        elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")):
                # Fill extra cols from row2  Make same adjustment as above
            c2= int(row2[i2].get("colspan","1"))
            while int(row1[i1].get("colspan","1")) != c2:
                result.append( ' '.join(row1[i1]) )
                c2 += 1
                i2 += 1
        else:
            assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1"))
            pass


        txt1= ' '.join(row1[i1])
        txt2= ' '.join(row2[i2])
        result.append( txt1 + " " + txt2 )
        print 'went to bottom', 'i1=',i1,'i2=',i2
        i1 += 1
        i2 += 1
print result

PyNEwbie 2008-11-11 06:41:52

1. Feel free to use function definitions to make this easier to read. 2. Accept an answer.

S.Lott 2008-11-11 12:17:27

I am not going to accept an answer yet as there is not a particularly great one though I have learned a lot form the provided answers. A great answer would work in the general case and it would work first time out of the box. I still need rows>2. I want to try the other two answers

PyNEwbie 2008-11-11 20:28:46

Answer 7

A:

Well I have an answer now. I was thinking through this and decided that I needed to use parts of every answer. I still need to figure out if I want a class or a function. But I have the algorithm that I think is probably more Pythonic than any of the others. But, it borrows heavily from the answers that some very generous people provided. I appreciate those a lot because I have learned quite a bit.

To save the time of having to make test cases I am going to paste the the complete code I have been banging away with in IDLE and follow that with an HTML sample file. Other than making a decision about class/function (and I need to think about how I am using this code in my program) I would be happy to see any improvements that make the code more Pythonic.

from BeautifulSoup import BeautifulSoup

original=file(r"C:\testheaders.htm").read()

soupOriginal=BeautifulSoup(original)
all_Rows=soupOriginal.findAll('tr')


header_Rows=[]
for each in range(len(all_Rows)):
    header_Rows.append(all_Rows[each])


header_Cells=[]
for each in header_Rows:
    header_Cells.append(each.findAll('td'))

temp_Header_Row=[]
header=[]
for row in range(len(header_Cells)):
    for column in range(len(header_Cells[row])):
     x=int(header_Cells[row][column].get("colspan","1"))
     if x==1:
      temp_Header_Row.append( ' '.join(header_Cells[row][column]) )

     else:
      for item in range(x):

       temp_Header_Row.append( ''.join(header_Cells[row][column]) )

    header.append(temp_Header_Row)
temp_Header_Row=[]
combined_Header=zip(*header)

for each in combined_Header:
    print each

Okay test file contents are below Sorry I tried to attach these but couldn't make it happen:

  <TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
  <TR valign="bottom">
  <TD width="40%">&nbsp;</TD>
  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FOODS WE LIKE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD>

  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">OTHER THAN</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MONTY PYTHON</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">CHERRYPY</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">APPLE PIE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MOTHERS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FATHERS</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD nowrap align="left">Name</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">SHOWS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PROGRAMS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">BANANAS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PERFUME</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">TOOLS</TD>
  <TD>&nbsp;</TD>
  </TR>
  </TABLE>

PyNEwbie 2008-11-12 04:20:19

ansaurus

tags:

views:

answers:

Is there a more Pythonic way to merge two HTML header rows with colspans?

S.LOTT's Answer Corrected

related questions