ansaurus

Question

Answer 1

+10 A:

l = list(str)
l[i] = 'e'
str = ''.join(l)

Can Berk Güder 2009-04-07 12:14:38

Looks nice but will it work with a huge file?

theycallmemorty 2009-04-07 12:29:30

@theycallmemorty: it consumes twice the memory as C, but other than that, I can't see any reason why it shouldn't work.

Can Berk Güder 2009-04-07 12:31:27

In fact, if there's a lot of such manipulation being done, it's probably best to keep the strings as lists of characters.

Lars Wirzenius 2009-04-07 12:43:04

this works and seem to be slightly faster than the array approach from another answer. However both methods are a lot slower than my previous code; currently ~7seconds vs 0.4seconds

Zitrax 2009-04-07 13:14:03

@liw.fi: correct. the ''.join(l) line should be user after all character-based modifications are done.

Can Berk Güder 2009-04-07 13:18:08

@Zitrax: what's your previous code? Python or the original language (C?). also, see my reply to liw.fi's comment.

Can Berk Güder 2009-04-07 13:19:42

Wow, I'm surprised array is so much slower. A list will use a lot more memory since it creates an object per character. Does mmap work any faster? (Also, don't call your variables 'str', that's the name of the string data type!)

Nicholas Riley 2009-04-07 13:25:43

@CBG: The previous code is Pike. I am not joining until done.

Zitrax 2009-04-07 13:28:31

@Nicholas: sorry if I was not clear, the difference in array vs list was just about 0.1s, the big difference was versus the pike version of this code

Zitrax 2009-04-07 13:29:45

@Zitrax: haven't used Pike, but an order of magnitude doesn't sound realistic between two interpreted languages. besides, Python is usually much faster than Ruby, etc.

Can Berk Güder 2009-04-07 13:33:19

I'm not saying it's not possible, but there might be another bottleneck somewhere else. I use Python to parse and analyze 500 MB trace files, and it's pretty fast (~30 secs).

Can Berk Güder 2009-04-07 13:37:33

Agreed - take a look at my other answer. I was able to trivially process a 5 MB file in about a second on a few-year-old laptop.

Nicholas Riley 2009-04-07 13:41:05

Found the problem, I am new to python so I did not realize that my for loops that used range() caused a lot of overhead by actually creating long lists. Using while loops instead reduced the time to about the same as the pike script.

Zitrax 2009-04-07 15:49:10

@Zitrax: you can use xrange, too.

Can Berk Güder 2009-04-07 16:31:02

Oh, and I'm glad the problem is solved.

Can Berk Güder 2009-04-07 16:31:39

Answer 2

+13 A:

Assuming you're not using a variable-length text encoding such as UTF-8, you can use array.array:

>>> import array
>>> a = array.array('c', 'foo')
>>> a[1] = 'e'
>>> a
array('c', 'feo')
>>> a.tostring()
'feo'

But since you're dealing with the contents of a file, mmap should be more efficient:

>>> f = open('foo', 'r+')
>>> import mmap
>>> m = mmap.mmap(f.fileno(), 0)
>>> m[:]
'foo\n'
>>> m[1] = 'e'
>>> m[:]
'feo\n'
>>> exit()
% cat foo
feo

Here's a quick benchmarking script (you'll need to replace dd with something else for non-Unix OSes):

import os, time, array, mmap

def modify(s):
    for i in xrange(len(s)):
        s[i] = 'q'

def measure(func):
    start = time.time()
    func(open('foo', 'r+'))
    print func.func_name, time.time() - start

def do_split(f):
    l = list(f.read())
    modify(l)
    return ''.join(l)

def do_array(f):
    a = array.array('c', f.read())
    modify(a)
    return a.tostring()

def do_mmap(f):
    m = mmap.mmap(f.fileno(), 0)
    modify(m)

os.system('dd if=/dev/random of=foo bs=1m count=5')

measure(do_mmap)
measure(do_array)
measure(do_split)

Output I got on my several-year-old laptop matches my intuition:

5+0 records in
5+0 records out
5242880 bytes transferred in 0.710966 secs (7374304 bytes/sec)
do_mmap 1.00865888596
do_array 1.09792494774
do_split 1.20163106918

So mmap is slightly faster but none of the suggested solutions is particularly different. If you're seeing a huge difference, try using cProfile to see what's taking the time.

Nicholas Riley 2009-04-07 12:15:39

I seems to recall that mmap is linux-only, so you could face portability problems.

Stefano Borini 2009-04-07 13:48:33

Nope, it works on Unix and Windows (http://docs.python.org/library/mmap.html). There are some minor API differences but nothing that affects this use case. Actually a bigger difference on Windows: do_mmap 0.65700006485; do_array 1.0150001049; do_split 0.827999830246.

Nicholas Riley 2009-04-07 13:58:28

Thanks for the tip about cProfile, it pointed me to the problem. The for loops used range() which caused a lot of overhead. I switched to while loops and now the performance is good.

Zitrax 2009-04-07 15:51:13

Cool! Glad you figured it out.

Nicholas Riley 2009-04-07 16:10:49

Answer 3

+1 A:

Try:

sl = list(s)
sl[i] = 'e'
s = ''.join(sl)

vartec 2009-04-07 12:16:05

Answer 4

+1 A:

Others have answered the string manipulation part of your question, but I think you ought to think about whether it would be better to parse the file and modify the data structure the text represents rather than manipulating the text directly.

Chris Upchurch 2009-04-07 14:57:50

ansaurus

tags:

views:

answers:

String manipulation in Python

related questions