views:

579

answers:

5

Hi all,

i'm importing several parts of a Databasedump in text Format into MySQL, the problem is that before the interesting Data there is very much non-interesting stuff infront. I wrote this loop to get to the needed data:

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

The Textfile is pretty huge, so just looping till the first wanted entry takes very much time. Anyone has an idea if this could be done faster (or if the whole way i fixed this is not the best idea) ?

Many thanks in advance!

A: 

This might help: Python Performance Part 2: Parsing Large Strings for 'A Href' Hypertext

Dustin Getz
readlines() sounds interesting, i'll check that solution first. Many thanks :-)
Birt
+1  A: 

I guess you are writing this import script and it gets boring to wait during testing it, so the data stays the same all the time.

You can run the script once to detect the actual positions in the file you want to jump to, with print txtdb.tell(). Write those down and replace the searching code with txtdb.seek( pos ). Basically that's builing an index for the file ;-)

Another more convetional way would be to read data in larger chunks, a few MB at a time, not just the few bytes on a line.

THC4k
I wish, sadly the top block keeps changing from import to import so can't just skip it :-/ Thanks for the idea tho' :-)
Birt
A: 

Tell us more about the file.

Can you use file.seek to do a binary search? Seek to the halfway mark, read a few lines, determine if you are before or after the part you need, recurse. That will turn your O(n) search into O(logn).

Dustin Getz
The file is ~7GB (today) and growing. It's pure ascii, the keywords i look for all end with a ":" but the keywords have different lengts. Between the keywords are lines with non interesting stuff that end with ":" too.
Birt
can you tell by the format if you seeked too far or not far enough? turning this algorithm into logn will yield enormous performance benefit
Dustin Getz
+1  A: 

The general idea for optimization is to proceed "by big blocks" (mostly-ignoring line structure) to locate the first line of interest, then move on to by-line processing for the rest). It's somewhat finicky and error-prone (off-by-one and the like) so it really needs testing, but the general idea is as follows...:

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

Here, I've tried to keep as much of your structure intact as feasible, doing only minor enhancements beyond the "big idea" of this refactoring.

Alex Martelli
Awesome! Many thanks for your help! :)
Birt
+4  A: 

Please do not write this code:

while condition is False:

Boolean conditions are boolean for cryin' out loud, so they can be tested (or negated and tested) directly:

while not condition:

Your second while loop isn't written as "while condition is True:", I'm curious why you felt the need to test "is False" in the first one.

Pulling out the dis module, I thought I'd dissect this a little further. In my pyparsing experience, function calls are total performance killers, so it would be nice to avoid function calls if possible. Here is your original test:

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

Two expensive things happen here, CALL_FUNCTION and LOAD_GLOBAL. You could cut back on LOAD_GLOBAL by defining a local name for False:

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

But what if we just drop the 'is' test completely?:

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

We've collapsed a LOAD_xxx and COMPARE_OP with a simple UNARY_NOT. "is False" certainly isn't helping the performance cause any.

Now what if we can do some gross elimination of a line without doing any function calls at all. If the first character of the line is not a 'c', there is no way it will startswith('customernum'). Let's try that:

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

(Note that using [0] to get the first character of a string does not create a slice - this is in fact very fast.)

Now, assuming there are not a large number of lines starting with 'c', the rough-cut filter can eliminate a line using all fairly fast instructions. In fact, by testing "t[0] != 'c'" instead of "not t[0] == 'c'" we save ourselves an extraneous UNARY_NOT instruction.

So using this learning about short-cut optimization and I suggest changing this code:

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

To this:

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

Note that I have also removed the .readline() function call, and just iterate over the file using "for sline in txtdb".

I realize Alex has provided a different body of code entirely for finding that first 'customernum' line, but I would try optimizing within the general bounds of your algorithm, before pulling out big but obscure block reading guns.

Paul McGuire
Many thanks for explaining this in detail, much appreciated.
Birt