ansaurus

Question

Answer 1

+1 A:

have you tested how slow it is or how fast you need, i think something like this will be fast enough

s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))

So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option

Edit: This is also the fastest

import timeit

s1="Nissim"
s2="------"

timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))

N=100000

print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)

output is

anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505

Anurag Uniyal 2009-10-02 09:15:59

it's not fair comparison at all, have a look at my suggestion on how to improve that answer. and it'll be twice faster than yours.

SilentGhost 2009-10-02 13:50:27

@SlientGhost I have added your solution and it is slowest

Anurag Uniyal 2009-10-02 13:56:17

I've added my timings

SilentGhost 2009-10-02 13:58:53

hmmm is it differnce between py2.5 and py3 or style of testing?

Anurag Uniyal 2009-10-02 14:08:25

I've tested first your way, and my results don't change

SilentGhost 2009-10-02 14:12:44

I guess SilentGhost solution is the fastest:anurag 0.775900713444dweeves 0.740725330615SilentGhost 0.599001875506I have Python 2.6 on windows

van 2009-10-02 15:13:04

Answer 2

+2 A:

Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:

 import re
 re.sub(r'(.)', r'\1-', u'Nissim')

 count = 1
 def repl(m):
     global count
     s = m.group(1) + unicode(count)
     count += 1
     return s
 re.sub(r'(.)', repl, u'Nissim')

Aaron Digulla 2009-10-02 09:23:34

I haven't tested but I think re module will give least performance and is overkill here

2009-10-02 13:26:05

Since you didn't test, there is a 90% chance that you're wrong :)

Aaron Digulla 2009-10-02 16:06:53

still 10% is the chance :)

Anurag Uniyal 2009-10-02 16:56:50

Answer 3

+2 A:

this might be a python effective solution:

s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])

dweeves 2009-10-02 09:24:39

Reduntant whole list creations will slow it down. Drop the [] just inside the outer .join(), and take in itertools.izip instead of zip.

kaizer.se 2009-10-02 10:19:27

`''.join(i+j for i, j in zip(s1, s2))` would be enough

SilentGhost 2009-10-02 10:21:15

@SlientHost: actually you solution is slower see my answer

Anurag Uniyal 2009-10-02 13:51:05

@kaizer.se - That's a suggestion that's worth profiling or timing. Bear in mind that zip is a built-in function and is therefore faster to call than izip. I don't know which is likely to be faster, but I know not to trust my instincts in this area. :-)

Jason Baker 2009-10-02 13:53:23

I don't see where you tested *my* solution, Anurag

SilentGhost 2009-10-02 13:54:01

@SlientGhost: sorry see now, my comp hanged ;)

Anurag Uniyal 2009-10-02 13:56:48

Answer 4

A:

Use Reduce.

>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'

The same with numbers too as long as you know which char maps to which. [dict can be handy]

>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'

ThisIsMeMoony 2009-10-02 10:50:55

I'd have to see numbers to believe that this is a fast solution. The overhead of a function call in Python can actually be pretty high.

Jason Baker 2009-10-02 13:48:49

@Jason: curiously enough, reduce is faster so far

SilentGhost 2009-10-02 15:51:17

Answer 5

+1 A:

here are my timings. Note, it's py3.1

>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844

SilentGhost 2009-10-02 13:55:54

Answer 6

A:

string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")

pixelbeat 2009-10-02 14:54:20

Answer 7

+7 A:

@gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.

Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.

So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).

I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.

There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).

So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)

Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).

However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).

So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.

Alex Martelli 2009-10-02 16:12:59

Alex, great response! Considering the array.array approach - it should only have one allocation total since I know how many chars I am going to have at output. So far I am going to test this one plus re and zip - the zip might give python a chance to properly treat the allocation and breakdown of the input. Will post results.

gnosis 2009-10-04 11:24:36

@gnosis, making the array once (e.g. fill it w/'-'s if that's the char you need to append to every other char) then assigning to some of its items is promising. But what's the length distribution of strings of interest to you?

Alex Martelli 2009-10-04 16:09:52

@alex-martelli, the length distribution is pretty much what you guessed - I'd say most of the time (like 80%) below 200 chars. Why does the filling matter? I would just allocate the memory once and then fill the data in by simple assignment. If I could allow output to be UTF-16 it could be a blast...

gnosis 2009-10-04 21:39:54

@gnosis, there is no "allocating" an array without filling it with something. So you might as well fill it with the right character(s) from the start, e.g. `x=array.array('U', (2*len(s))*'-')` to allocate it filled with dashes, then `x[::2]=array.array('U', x)`, this should probably be the fastest approach (or other slice assignments if you're requiring something slightly different). You do need unicode internally, but a final .encode('utf-8') shouldn't slow you down too much (and neither should the initial .decode upon input).

Alex Martelli 2009-10-04 23:33:21

ansaurus

tags:

views:

answers:

fast string modification in python

related questions