views:

75

answers:

2

Ok here is my existing code:

////////////// = []
for line in datafile:
    splitline = line.split()
    for item in splitline:
        if not item.endswith("JAX"):
            if item.startswith("STF") or item.startswith("BRACKER"):
                //////////.append( item )


for line in //////////
    print /////////////
   /////////// +=1
    for t in//////
        if t in line[:line.find(',')]:
            line = line.strip().split(',')
           ///////////////write(','.join(line[:3]) + '\n') 
            break

/////////////.close() /////////////close() ///////////.close()

I want to make a further optimization. The file is really large. I would like to delete the lines from itthat have been matched after they have been matched and written to the small file to reduce the amount of time it takes to search through the big file. Any suggestions on how I should go about this?

Thanks Bob.

A: 

You cannot delete lines in a text file - it would require moving all the data after the deleted line up to fill the gap, and would be massively inefficient.

One way to do it is to write a temp file with all the lines you want to keep in bigfile.txt, and when you have finished processing delete bigfile.txt and rename the temp file to replace it.

Alternatively if bigfile.txt is small enough to fit in memory you could read the entire file into a list and delete the lines from the list, then write the list back to disk.

I would also guess from your code that bigfile.txt is some sort of CSV file. If so then it may be better to convert it to a database file and use SQL to query it. Python comes with the SQLite module built in and there are 3rd party libraries for most other databases.

Dave Kirby
A: 

As I said in a comment, it doesn't look to me like the size of "bigfile" should be slowing down the speed at which the count increments. When you iterate over a file like that, Python just reads one line at a time in order.

The optimizations you can do at this point depend on how big matchedLines is, and what relationship the matchedLines strings have to the text you're looking in.

If matchedLines is big, you could save time by only doing the 'find' once:

for line in completedataset:
   text = line[:line.find(',')] 
   for t in matchedLines:
        if t in text:
            line = line.strip().split(',')
            smallerdataset.write(','.join(line[:3]) + '\n') 
            break

In my tests, the 'find' took about 300 nanoseconds, so if matchedLines is a few million items long, there's your extra second right there.

If you're looking for exact matches, not substring matches, you can speed things WAY up by using a set:

matchedLines = set(matchedLines)
for line in completedataset:
    target = line[:line.find(',')]
    ## One lookup and you're done!
    if target in matchedLines:
        line = line.strip().split(',')
        smallerdataset.write(','.join(line[:3]) + '\n') 

If the target texts that don't match tend to look completely different from ones that do (for example, most of the targets are random strings, and matchedLines is a bunch of names) AND the matchedLines are all above a certain length, you could try to get really clever by checking substrings. Suppose all matchedLines are at least 5 characters long...

def subkeys(s):
    ## e.g. if len(s) is 7, return s[0:5], s[1:6], s[2:7].
    return [s[i:i+5] for i in range(len(s) + 1 - 5)]

existing_subkeys = set()
for line in matchedLines:
    existing_subkeys.update(subkeys(line))

for line in completedataset:
    target = line[:line.find(',')]
    might_match = False
    for subkey in subkeys(target):
        if subkey in existing_subkeys:
            might_match = True
            break
    if might_match:
        # Then we have to do the old slow way.
        for matchedLine in matchedLines:
            if matchedLine in target:
                # Do the split and write and so on.

But it's really easy to outsmart yourself trying to do things like that, and it depends what your data looks like.

fholo