views:

130

answers:

3

I have the same code, written using win32com and xlrd. xlrd preforms the algorithm in less than a second, while win32com takes minutes.

Here is the win32com:

def makeDict(ws):
"""makes dict with key as header name, 
   value as tuple of column begin and column end (inclusive)"""
wsHeaders = {} # key is header name, value is column begin and end inclusive
for cnum in xrange(9, find_last_col(ws)):
    if ws.Cells(7, cnum).Value:
        wsHeaders[str(ws.Cells(7, cnum).Value)] = (cnum, find_last_col(ws))
        for cend in xrange(cnum + 1, find_last_col(ws)): #finds end column
            if ws.Cells(7, cend).Value:
                wsHeaders[str(ws.Cells(7, cnum).Value)] = (cnum, cend - 1)
                break
return wsHeaders

And the xlrd

def makeDict(ws):
"""makes dict with key as header name, 
   value as tuple of column begin and column end (inclusive)"""
wsHeaders = {} # key is header name, value is column begin and end inclusive
for cnum in xrange(8, ws.ncols):
    if ws.cell_value(6, cnum):
        wsHeaders[str(ws.cell_value(6, cnum))] = (cnum, ws.ncols)
        for cend in xrange(cnum + 1, ws.ncols):#finds end column
            if ws.cell_value(6, cend):
                wsHeaders[str(ws.cell_value(6, cnum))] = (cnum, cend - 1)
                break
return wsHeaders
A: 

COM requires talking to another process which actually handles the requests. xlrd works in-process on the data structures themselves.

Ignacio Vazquez-Abrams
so is it impossible to do this in a reasonable amount of time, with win32com?
Josh
+8  A: 

(0) You asked "Why is win32com so much slower than xlrd?" ... this question is a bit like "Have you stopped beating your wife?" --- it is based on a presupposition that may not be true; win32com was written in C by a brilliant programmer, but xlrd was written in pure Python by an average programmer. The real difference is that win32com has to call COM which involves inter-process communication and was written by you-know-who, whereas xlrd is reading the Excel file directly. Moreover, there's a fourth party in the scenario: YOU. Please read on.

(1) You don't show us the source of the find_last_col() function that you use repetitively in the COM code. In the xlrd code, you are happy to use the same value (ws.ncols) all the time. So in the COM code, you should call find_last_col(ws) ONCE and thereafter used the returned result. Update See answer to your separate question on how to get the equivalent of xlrd's Sheet.ncols from COM.

(2) Accessing each cell value TWICE is slowing down both codes. Instead of

if ws.cell_value(6, cnum):
    wsHeaders[str(ws.cell_value(6, cnum))] = (cnum, ws.ncols)

try

value = ws.cell_value(6, cnum)
if value:
    wsHeaders[str(value)] = (cnum, ws.ncols)

Note: there are 2 cases of this in each code snippet.

(3) It is not at all apparent what the purpose of your nested loops are, but there does seem to be some redundant computation, involving redundant fetches from COM. If you care to tell us what you are trying to achieve, with examples, we could be able to help you make it run much faster. At the very least, extracting the values from COM once then processing them in nested loops in Python should be faster. How many columns are there?

Update 2 Meanwhile the little elves took to your code with the proctoscope, and came up with the following script:

tests= [
    "A/B/C/D",
    "A//C//",
    "A//C//E",
    "A///D",
    "///D",
    ]
for test in tests:
    print "\nTest:", test
    row = test.split("/")
    ncols = len(row)
    # modelling the OP's code
    # (using xlrd-style 0-relative column indexes)
    d = {}
    for cnum in xrange(ncols):
        if row[cnum]:
            k = row[cnum]
            v = (cnum, ncols) #### BUG; should be ncols - 1 ("inclusive")
            print "outer", cnum, k, '=>', v
            d[k] = v
            for cend in xrange(cnum + 1, ncols):
                if row[cend]:
                    k = row[cnum]
                    v = (cnum, cend - 1)
                    print "inner", cnum, cend, k, '=>', v
                    d[k] = v
                    break
    print d
    # modelling a slightly better algorithm
    d = {}
    prev = None
    for cnum in xrange(ncols):
        key = row[cnum]
        if key:
            d[key] = [cnum, cnum]
            prev = key
        elif prev:
            d[prev][1] = cnum
    print d
    # if tuples are really needed (can't imagine why)
    for k in d:
        d[k] = tuple(d[k])
    print d

which outputs this:

Test: A/B/C/D
outer 0 A => (0, 4)
inner 0 1 A => (0, 0)
outer 1 B => (1, 4)
inner 1 2 B => (1, 1)
outer 2 C => (2, 4)
inner 2 3 C => (2, 2)
outer 3 D => (3, 4)
{'A': (0, 0), 'C': (2, 2), 'B': (1, 1), 'D': (3, 4)}
{'A': [0, 0], 'C': [2, 2], 'B': [1, 1], 'D': [3, 3]}
{'A': (0, 0), 'C': (2, 2), 'B': (1, 1), 'D': (3, 3)}

Test: A//C//
outer 0 A => (0, 5)
inner 0 2 A => (0, 1)
outer 2 C => (2, 5)
{'A': (0, 1), 'C': (2, 5)}
{'A': [0, 1], 'C': [2, 4]}
{'A': (0, 1), 'C': (2, 4)}

Test: A//C//E
outer 0 A => (0, 5)
inner 0 2 A => (0, 1)
outer 2 C => (2, 5)
inner 2 4 C => (2, 3)
outer 4 E => (4, 5)
{'A': (0, 1), 'C': (2, 3), 'E': (4, 5)}
{'A': [0, 1], 'C': [2, 3], 'E': [4, 4]}
{'A': (0, 1), 'C': (2, 3), 'E': (4, 4)}

Test: A///D
outer 0 A => (0, 4)
inner 0 3 A => (0, 2)
outer 3 D => (3, 4)
{'A': (0, 2), 'D': (3, 4)}
{'A': [0, 2], 'D': [3, 3]}
{'A': (0, 2), 'D': (3, 3)}

Test: ///D
outer 3 D => (3, 4)
{'D': (3, 4)}
{'D': [3, 3]}
{'D': (3, 3)}
John Machin
+1 I agree - the slowdown is not because of COM but because of the OP's useage of it that makes it seem more slow.
Cam
There are also ways to read and write multiple values to/from Excel using COM via safearrays, precisely because of the IPC overhead. You can do this transparently via win32com, by specifying ranges instead of individual cells.
Ryan Ginstrom
A: 

Thought about it as I was going to bed last night, and ended up using this. A far superior version to my original:

def makeDict(ws):
"""makes dict with key as header name, 
   value as tuple of column begin and column end (inclusive)"""
wsHeaders = {} # key is header name, value is column begin and end inclusive
last_col = find_last_col(ws)

for cnum in xrange(9, last_col):
    if ws.Cells(7, cnum).Value:
        value = ws.Cells(7, cnum).Value
        cstart = cnum
    if ws.Cells(7, cnum + 1).Value:
        wsHeaders[str(value)] = (cstart, cnum) #cnum is last in range
return wsHeaders
Josh
(1) You are still accessing the same cell twice (2) Previous code hardcoded 8th column and 6th row; this one uses 9 and 7 respectively; why?
John Machin