views:

186

answers:

4

I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.

My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.

I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.

Thanks in advance. Here is my code (don't laugh!):

# -*- coding: utf-8 -*-

import os, codecs

dir = "C:/Users/Roy/Desktop/test/"

Dict = {'hello' : 'good', 'world' : 'bad'}

for dirs, subdirs, files in os.walk(dir):
    for file in files:
        inFile = codecs.open(dir + file, "r", "utf-8")
        inFileStr = inFile.read()
        inFile.close()
        inFile = codecs.open(dir + file, "w", "utf-8")
        for key in Dict:
            inFileStr = inFileStr.replace(key, Dict[key])
        inFile.write(inFileStr)
        inFile.close()
A: 

Open the files read/write ('r+') and avoid the double open/close (and likely associated buffer flush). Also, if possible, don't write back the entire file, seek and write back only the changed areas after doing the replace on the file's contents. Read, replace, write changed areas (if any).

That still won't help performance too much though: I'd profile and determine where the performance hit actually is and then move onto optimising it. It could just be the reading of the data from disk that's very slow, and there's not much you can do about that in Python.

Matthew Iselin
'rw' is not 'read/write'. It's just 'read', as the 'w' is completely ignored. The modes for 'read/write' are 'r+', 'w+' and 'a+', where each does something subtly different. Rewriting a file as you read it is tricky, as you need to seek inbetween reads and writes and you need to be careful not to overwrite what you haven't read yet.
Thomas Wouters
@Thomas: Ah, yes. Always get caught out on the open() flags. Too much C :). Anyway, my suggestion was to read the file completely first and then only write back changes, not to write back changes while reading.
Matthew Iselin
The string you pass to open() is actually what you'd pass to `fopen()` in C (and why it has such sucky semantics) so "too much C" is hardly an excuse :-)
Thomas Wouters
@Thomas: I'm thinking more like a direct translation from O_RDWR with `open()` (which I use far more than `fopen()` in the environment I write code in). Also, some platforms accept "rw" to `fopen()` too - it's just not standardised.
Matthew Iselin
+1  A: 

A few things (unrelated to the optimization problem):

dir + file should be os.path.join(dir, file)

You might want to not reuse infile, but instead open (and write to) a separate outfile. This also won't increase performance, but is good practice.

I don't know if you're I/O bound or cpu bound, but if your cpu utilization is very high, you may want to use threading, with each thread operating on a different file (so with a quad core processor, you'd be reading/writing 4 different files simultaneously).

babbitt
You have the threading advice completely backwards. In Python, thread to get around IO bounds. This is due to the Global Interpreter Lock. You use subprocesses for CPU/memory bounded applications which is what this is. (only 50 IO operations in a week ;)
aaronasterling
Good point. I knew about the global lock, but didn't actually think about subprocesses vs. threads. Learning something new every day.
babbitt
@AaronMcSmooth: I would expect this to be I/O bound, since searching for a string and replacing it from a dictionary is pretty low-effort for a modern processor. But in this case multithreading won't help unless some of the files are on separate physical disks or it's possible to locate the translated files on a different physical disk.
intuited
+7  A: 

In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.

If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:

for file in files:
    fname = os.path.join(dir, file)
    inFile = codecs.open(fname, "r", "utf-8")
    outFile = codecs.open(fname + ".new", "w", "utf-8")
    for line in inFile:
        newline = do_replacements_on(line)
        outFile.write(newline)
    inFile.close()
    outFile.close()
    os.rename(fname + ".new", fname)

If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.

Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:

>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'

For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:

>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
...     return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'
Thomas Wouters
I'd forgotten about non mutable string. Much nicer than my answer.
aaronasterling
I was going to add to your answer that the 500Mb string isn't just a matter of fitting into RAM or pushing into swap, but also of how most architectures deal better with repeated operations on a smaller set of data (something that fits into the CPU caches well, although Python quickly fills the cache with its own stuff.) On top of that, Python also optimizes allocations of smaller objects more than of large ones, which matters in particular on Windows (but all platforms benefit from it to some degree.)
Thomas Wouters
Locating the output files on a different physical disk will likely make the overall procedure run faster, since the bottleneck will be in reading from and writing to disk. You could probably further improve performance by doing the writes in a separate thread and passing each line to it through a `Queue.Queue`. I think the usefulness of this last measure would depend on the effectiveness of the reading drive's readahead cache in combination with any write caching on the writing drive. But that's also maybe a bit too heavy for a Python beginner.
intuited
Threads won't do anything significant; any benefit from parallelized reads or writes is greatly negated by all the overhead, which is buckets. Writing to a different spindle probably would matter a bit, but it would mean you can't do the `os.rename()` at the end.
Thomas Wouters
+1 for translate, +1 for regex and +1 for reading chunks .. if i could.
THC4k
@Thomas Wouters: Sorry, I'm not sure what you mean by "spindle". If by "different spindle" you mean "different hard drive", then yes, that's what I was suggesting. My understanding of the way that Python works is that unless the reading and writing is happening in separate threads, it won't be able to both read and write at the same time. This would basically (IIUC) negate the benefits of writing to a separate drive unless the drives are able to simultaneously read and write anyway to/from their caches.
intuited
@intuited: Python doesn't do much in terms of delayed writes, but neither would threads. As I said, the overhead of threads would dwarf any benefits you might wring out of the OS. Intelligent use of the drives all down to the OS buffering and caching, which most do rather agressively.
Thomas Wouters
+2  A: 

I think you can lower memory use greatly (and thus limit swap use and make things faster) by reading a line at a time and writing it (after the regexp replacements already suggested) to a temporary file - then moving the file to replace the original.

Radomir Dopieralski