Please do not write this code:
while condition is False:
Boolean conditions are boolean for cryin' out loud, so they can be tested (or negated and tested) directly:
while not condition:
Your second while loop isn't written as "while condition is True:", I'm curious why you felt the need to test "is False" in the first one.
Pulling out the dis module, I thought I'd dissect this a little further. In my pyparsing experience, function calls are total performance killers, so it would be nice to avoid function calls if possible. Here is your original test:
>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 LOAD_GLOBAL 1 (False)
15 COMPARE_OP 8 (is)
18 RETURN_VALUE
Two expensive things happen here, CALL_FUNCTION
and LOAD_GLOBAL
. You could cut back on LOAD_GLOBAL
by defining a local name for False:
>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 LOAD_FAST 1 (False)
15 COMPARE_OP 8 (is)
18 RETURN_VALUE
But what if we just drop the 'is' test completely?:
>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 UNARY_NOT
13 RETURN_VALUE
We've collapsed a LOAD_xxx
and COMPARE_OP
with a simple UNARY_NOT
. "is False" certainly isn't helping the performance cause any.
Now what if we can do some gross elimination of a line without doing any function calls at all. If the first character of the line is not a 'c', there is no way it will startswith('customernum'). Let's try that:
>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_CONST 0 (0)
6 BINARY_SUBSCR
7 LOAD_CONST 1 ('c')
10 COMPARE_OP 3 (!=)
13 JUMP_IF_FALSE 14 (to 30)
16 POP_TOP
17 LOAD_FAST 0 (t)
20 LOAD_ATTR 0 (startswith)
23 LOAD_CONST 2 ('customernum')
26 CALL_FUNCTION 1
29 UNARY_NOT
>> 30 RETURN_VALUE
(Note that using [0] to get the first character of a string does not create a slice - this is in fact very fast.)
Now, assuming there are not a large number of lines starting with 'c', the rough-cut filter can eliminate a line using all fairly fast instructions. In fact, by testing "t[0] != 'c'" instead of "not t[0] == 'c'" we save ourselves an extraneous UNARY_NOT
instruction.
So using this learning about short-cut optimization and I suggest changing this code:
while sline.startswith("customernum: ") is False:
sline = txtdb.readline()
while sline.startswith("customernum: "):
... do the rest of the customer data stuff...
To this:
for sline in txtdb:
if sline[0] == 'c' and \
sline.startswith("customernum: "):
... do the rest of the customer data stuff...
Note that I have also removed the .readline() function call, and just iterate over the file using "for sline in txtdb".
I realize Alex has provided a different body of code entirely for finding that first 'customernum' line, but I would try optimizing within the general bounds of your algorithm, before pulling out big but obscure block reading guns.