views:

184

answers:

3

Hi,

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines followed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.

Currently I am using two loops; one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).

This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?

Thanks!

Here is my code:

def cat_files_sam(final_file_name,work_directory_master,file_count):

    final_file = open(final_file_name,"w")

    if len(file_count) > 1:
        file_count=sort_output_files(file_count)

    # only for @ headers    
    for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):

            if tmp_count == 1000000:
                final_file.writelines(tmp_list)
                tmp_list = []
                tmp_count = 0

            tmp_list.append(line)
            tmp_count += 1

        else:
            final_file.writelines(tmp_list)
            break

    for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):
            continue
        if tmp_count == 1000000:
            final_file.writelines(tmp_list)
            tmp_list = []
            tmp_count = 0

        tmp_list.append(line)
        tmp_count += 1
        final_file.writelines(tmp_list)

    final_file.close()
+1  A: 

How fast would you expect it to be to move 13Gb of data around? This problem is I/O bound and not a problem with Python. To make it faster, do less I/O. Which means that you are either (a) stuck with the speed you've got or (b) should retool later elements of your toolchain to handle the files in-place rather than requiring one giant 13 Gb file.

pboothe
"How fast would you expect": per http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate , 70 MB/sec (disk-to-memory -). The approach I suggest may reduce the I/O (not reading the files' headers twice).
Alex Martelli
But if it is not a different disk (and I bet it isn't), then this particular program is actually running almost optimally! :)
pboothe
+2  A: 

You can save the time it takes the 2nd time to skip the headers, as long as you have a reasonable amount of spare disk space: as well as the final file, also open (for 'w+') a temporary file temp_file, and do:

import shutil

hdr_list = []
bod_list = []
dispatch = {True: (hdr_list, final_file), 
            False: (bod_list, temp_file)}

for bowtie_file in file_count:
    with open(os.path.join(work_directory_master,bowtie_file)) as f:
        for line in f:
            L, fou = dispatch[line[0]=='@']
            L.append(f)
            if len(L) == 1000000:
                fou.writelines(L)
                del L[:]

# write final parts, if any
for L, fou in dispatch.items():
    if L: fou.writelines(L)

temp_file.seek(0)
shutil.copyfileobj(temp_file, final_file)

This should enhance your program's performance. Fine-tuning that now-hard-coded 1000000, or even completely doing away with the lists and writing each line directly to the appropriate file (final or temporary), are other options you should benchmark (but if you have unbounded amounts of memory, then I expect that they won't matter much -- however, intuitions about performance are often misleading, so it's best to try and measure!-).

Alex Martelli
@Alex: need to write the last block as well after the loop
van
give a man a fish... ;)
msw
@van, you're right -- adding that.
Alex Martelli
@msw, doesn't "smell" like homework, so showing the start of an optimization seems OK to me -- it suggests idioms such as dispatching by dict, and useful stdlib modules such as `shutil`, so I hope it can be instructive as well as rapidly useful.
Alex Martelli
@alex, agreed in full, I'm sorry if my joke appeared critical as that wasn't intended.
msw
@ Alex thank you for this example! @ msw I am interested in how to optimize this process as I do this (and many related problems) on a daily basis, although I am still a junior programmer I enjoy the chance to make my code work as efficiently as possible
wemmett
A: 

There are two gross inefficiencies in the code you meant to write (which is not the code presented):

  1. You are building up huge lists of header lines in the first major for block instead of just writing them out.
  2. You are skipping the headers of the files again in the second major for block line by line when you've already determined where the headers end in (1). See file.seek and file.tell
msw
Thanks for the advice! I had a conversation yesterday about this with a more senior programmer and have implemented seek() and tell() in my code (I have only seen these posts now so I havent implemented any other improvements) but this has had a very small impact overall and this still takes roughly 6 minutes.I was unaware that building lists in memory is slower than writing every line individually. I thought the I/O overhead of doing this is higher than writing a whole chunk at once. I am just curious why this is the case.
wemmett
You are going to have to do the same amount i/o eventually whether you build the list in memory first or just emit it line-by-line as you read it. Things you don't do (list construction, in this case) take zero time. The library is (silently) buffering your reads and writes so not every readline (for line in f) or writeline() is actually making a disk access.
msw
That said, Alex Martelli's approach may be far more efficient and far more 'Pythonic' than the smaller optimizations I mentioned. I say "may be" as I've not measured it and oftentimes - in any language - guesses about efficiency are often incorrect.
msw