ansaurus

Question

Optimizing find and replace over large files in Python

Answer 1

A:

Open the files read/write ('r+') and avoid the double open/close (and likely associated buffer flush). Also, if possible, don't write back the entire file, seek and write back only the changed areas after doing the replace on the file's contents. Read, replace, write changed areas (if any).

That still won't help performance too much though: I'd profile and determine where the performance hit actually is and then move onto optimising it. It could just be the reading of the data from disk that's very slow, and there's not much you can do about that in Python.

Matthew Iselin 2010-09-26 22:49:18

'rw' is not 'read/write'. It's just 'read', as the 'w' is completely ignored. The modes for 'read/write' are 'r+', 'w+' and 'a+', where each does something subtly different. Rewriting a file as you read it is tricky, as you need to seek inbetween reads and writes and you need to be careful not to overwrite what you haven't read yet.

Thomas Wouters 2010-09-26 22:51:22

@Thomas: Ah, yes. Always get caught out on the open() flags. Too much C :). Anyway, my suggestion was to read the file completely first and then only write back changes, not to write back changes while reading.

Matthew Iselin 2010-09-26 22:54:57

The string you pass to open() is actually what you'd pass to `fopen()` in C (and why it has such sucky semantics) so "too much C" is hardly an excuse :-)

Thomas Wouters 2010-09-26 23:17:20

@Thomas: I'm thinking more like a direct translation from O_RDWR with `open()` (which I use far more than `fopen()` in the environment I write code in). Also, some platforms accept "rw" to `fopen()` too - it's just not standardised.

Matthew Iselin 2010-09-27 01:06:55

Answer 2

+1 A:

A few things (unrelated to the optimization problem):

dir + file should be os.path.join(dir, file)

You might want to not reuse infile, but instead open (and write to) a separate outfile. This also won't increase performance, but is good practice.

I don't know if you're I/O bound or cpu bound, but if your cpu utilization is very high, you may want to use threading, with each thread operating on a different file (so with a quad core processor, you'd be reading/writing 4 different files simultaneously).

babbitt 2010-09-26 23:00:37

You have the threading advice completely backwards. In Python, thread to get around IO bounds. This is due to the Global Interpreter Lock. You use subprocesses for CPU/memory bounded applications which is what this is. (only 50 IO operations in a week ;)

aaronasterling 2010-09-26 23:02:50

Good point. I knew about the global lock, but didn't actually think about subprocesses vs. threads. Learning something new every day.

babbitt 2010-09-27 00:06:49

@AaronMcSmooth: I would expect this to be I/O bound, since searching for a string and replacing it from a dictionary is pretty low-effort for a modern processor. But in this case multithreading won't help unless some of the files are on separate physical disks or it's possible to locate the translated files on a different physical disk.

intuited 2010-09-27 00:22:42

Answer 3

+7 A:

In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.

If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:

for file in files:
    fname = os.path.join(dir, file)
    inFile = codecs.open(fname, "r", "utf-8")
    outFile = codecs.open(fname + ".new", "w", "utf-8")
    for line in inFile:
        newline = do_replacements_on(line)
        outFile.write(newline)
    inFile.close()
    outFile.close()
    os.rename(fname + ".new", fname)

If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.

Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:

>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'

For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:

>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
...     return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'

Thomas Wouters 2010-09-26 23:11:22

I'd forgotten about non mutable string. Much nicer than my answer.

aaronasterling 2010-09-26 23:25:16

I was going to add to your answer that the 500Mb string isn't just a matter of fitting into RAM or pushing into swap, but also of how most architectures deal better with repeated operations on a smaller set of data (something that fits into the CPU caches well, although Python quickly fills the cache with its own stuff.) On top of that, Python also optimizes allocations of smaller objects more than of large ones, which matters in particular on Windows (but all platforms benefit from it to some degree.)

Thomas Wouters 2010-09-26 23:29:32

Locating the output files on a different physical disk will likely make the overall procedure run faster, since the bottleneck will be in reading from and writing to disk. You could probably further improve performance by doing the writes in a separate thread and passing each line to it through a `Queue.Queue`. I think the usefulness of this last measure would depend on the effectiveness of the reading drive's readahead cache in combination with any write caching on the writing drive. But that's also maybe a bit too heavy for a Python beginner.

intuited 2010-09-26 23:53:04

Threads won't do anything significant; any benefit from parallelized reads or writes is greatly negated by all the overhead, which is buckets. Writing to a different spindle probably would matter a bit, but it would mean you can't do the `os.rename()` at the end.

Thomas Wouters 2010-09-27 00:04:55

+1 for translate, +1 for regex and +1 for reading chunks .. if i could.

THC4k 2010-09-27 00:08:41

@Thomas Wouters: Sorry, I'm not sure what you mean by "spindle". If by "different spindle" you mean "different hard drive", then yes, that's what I was suggesting. My understanding of the way that Python works is that unless the reading and writing is happening in separate threads, it won't be able to both read and write at the same time. This would basically (IIUC) negate the benefits of writing to a separate drive unless the drives are able to simultaneously read and write anyway to/from their caches.

intuited 2010-09-27 00:18:38

@intuited: Python doesn't do much in terms of delayed writes, but neither would threads. As I said, the overhead of threads would dwarf any benefits you might wring out of the OS. Intelligent use of the drives all down to the OS buffering and caching, which most do rather agressively.

Thomas Wouters 2010-09-27 00:26:17

Answer 4

+2 A:

I think you can lower memory use greatly (and thus limit swap use and make things faster) by reading a line at a time and writing it (after the regexp replacements already suggested) to a temporary file - then moving the file to replace the original.

Radomir Dopieralski 2010-09-26 23:15:37

ansaurus

tags:

views:

answers:

Optimizing find and replace over large files in Python

related questions