views:

1264

answers:

4

I am converting some code from another language to python. That code reads a rather large file into a string and then manipulates it by array indexing like:

str[i] = 'e'

This does not work directly in python due to the strings being immutable. What is the preferred way of doing this in python ?

I have seen the string.replace() function, but it returns a copy of the string which does not sound very optimal as the string in this case is an entire file.

+10  A: 
l = list(str)
l[i] = 'e'
str = ''.join(l)
Can Berk Güder
Looks nice but will it work with a huge file?
theycallmemorty
@theycallmemorty: it consumes twice the memory as C, but other than that, I can't see any reason why it shouldn't work.
Can Berk Güder
In fact, if there's a lot of such manipulation being done, it's probably best to keep the strings as lists of characters.
Lars Wirzenius
this works and seem to be slightly faster than the array approach from another answer. However both methods are a lot slower than my previous code; currently ~7seconds vs 0.4seconds
Zitrax
@liw.fi: correct. the ''.join(l) line should be user after all character-based modifications are done.
Can Berk Güder
@Zitrax: what's your previous code? Python or the original language (C?). also, see my reply to liw.fi's comment.
Can Berk Güder
Wow, I'm surprised array is so much slower. A list will use a lot more memory since it creates an object per character. Does mmap work any faster? (Also, don't call your variables 'str', that's the name of the string data type!)
Nicholas Riley
@CBG: The previous code is Pike. I am not joining until done.
Zitrax
@Nicholas: sorry if I was not clear, the difference in array vs list was just about 0.1s, the big difference was versus the pike version of this code
Zitrax
@Zitrax: haven't used Pike, but an order of magnitude doesn't sound realistic between two interpreted languages. besides, Python is usually much faster than Ruby, etc.
Can Berk Güder
I'm not saying it's not possible, but there might be another bottleneck somewhere else. I use Python to parse and analyze 500 MB trace files, and it's pretty fast (~30 secs).
Can Berk Güder
Agreed - take a look at my other answer. I was able to trivially process a 5 MB file in about a second on a few-year-old laptop.
Nicholas Riley
Found the problem, I am new to python so I did not realize that my for loops that used range() caused a lot of overhead by actually creating long lists. Using while loops instead reduced the time to about the same as the pike script.
Zitrax
@Zitrax: you can use xrange, too.
Can Berk Güder
Oh, and I'm glad the problem is solved.
Can Berk Güder
+13  A: 

Assuming you're not using a variable-length text encoding such as UTF-8, you can use array.array:

>>> import array
>>> a = array.array('c', 'foo')
>>> a[1] = 'e'
>>> a
array('c', 'feo')
>>> a.tostring()
'feo'

But since you're dealing with the contents of a file, mmap should be more efficient:

>>> f = open('foo', 'r+')
>>> import mmap
>>> m = mmap.mmap(f.fileno(), 0)
>>> m[:]
'foo\n'
>>> m[1] = 'e'
>>> m[:]
'feo\n'
>>> exit()
% cat foo
feo

Here's a quick benchmarking script (you'll need to replace dd with something else for non-Unix OSes):

import os, time, array, mmap

def modify(s):
    for i in xrange(len(s)):
        s[i] = 'q'

def measure(func):
    start = time.time()
    func(open('foo', 'r+'))
    print func.func_name, time.time() - start

def do_split(f):
    l = list(f.read())
    modify(l)
    return ''.join(l)

def do_array(f):
    a = array.array('c', f.read())
    modify(a)
    return a.tostring()

def do_mmap(f):
    m = mmap.mmap(f.fileno(), 0)
    modify(m)

os.system('dd if=/dev/random of=foo bs=1m count=5')

measure(do_mmap)
measure(do_array)
measure(do_split)

Output I got on my several-year-old laptop matches my intuition:

5+0 records in
5+0 records out
5242880 bytes transferred in 0.710966 secs (7374304 bytes/sec)
do_mmap 1.00865888596
do_array 1.09792494774
do_split 1.20163106918

So mmap is slightly faster but none of the suggested solutions is particularly different. If you're seeing a huge difference, try using cProfile to see what's taking the time.

Nicholas Riley
I seems to recall that mmap is linux-only, so you could face portability problems.
Stefano Borini
Nope, it works on Unix and Windows (http://docs.python.org/library/mmap.html). There are some minor API differences but nothing that affects this use case. Actually a bigger difference on Windows: do_mmap 0.65700006485; do_array 1.0150001049; do_split 0.827999830246.
Nicholas Riley
Thanks for the tip about cProfile, it pointed me to the problem. The for loops used range() which caused a lot of overhead. I switched to while loops and now the performance is good.
Zitrax
Cool! Glad you figured it out.
Nicholas Riley
+1  A: 

Try:

sl = list(s)
sl[i] = 'e'
s = ''.join(sl)
vartec
+1  A: 

Others have answered the string manipulation part of your question, but I think you ought to think about whether it would be better to parse the file and modify the data structure the text represents rather than manipulating the text directly.

Chris Upchurch