views:

490

answers:

7

Hi all,

This is partially a theoretical question:

I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:

"Nissim" becomes "N-i-s-s-i-m-"


"01234" becomes "0a1b2c3d4e"

and so on. I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).

I would either use an array (pre-allocated) or try to make my own C module to handle this.

Anyone has better ideas for this kind of thing?

(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),

Oh and its Python 2.5, so no shiny Python 3 thingies are available here.

Thanks

+1  A: 

have you tested how slow it is or how fast you need, i think something like this will be fast enough

s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))

So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option

Edit: This is also the fastest

import timeit

s1="Nissim"
s2="------"

timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))

N=100000

print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)

output is

anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505
Anurag Uniyal
it's not fair comparison at all, have a look at my suggestion on how to improve that answer. and it'll be twice faster than yours.
SilentGhost
@SlientGhost I have added your solution and it is slowest
Anurag Uniyal
I've added my timings
SilentGhost
hmmm is it differnce between py2.5 and py3 or style of testing?
Anurag Uniyal
I've tested first your way, and my results don't change
SilentGhost
I guess SilentGhost solution is the fastest:anurag 0.775900713444dweeves 0.740725330615SilentGhost 0.599001875506I have Python 2.6 on windows
van
+2  A: 

Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:

 import re
 re.sub(r'(.)', r'\1-', u'Nissim')

 count = 1
 def repl(m):
     global count
     s = m.group(1) + unicode(count)
     count += 1
     return s
 re.sub(r'(.)', repl, u'Nissim')
Aaron Digulla
I haven't tested but I think re module will give least performance and is overkill here
Since you didn't test, there is a 90% chance that you're wrong :)
Aaron Digulla
still 10% is the chance :)
Anurag Uniyal
+2  A: 

this might be a python effective solution:

s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])
dweeves
Reduntant whole list creations will slow it down. Drop the [] just inside the outer .join(), and take in itertools.izip instead of zip.
kaizer.se
`''.join(i+j for i, j in zip(s1, s2))` would be enough
SilentGhost
@SlientHost: actually you solution is slower see my answer
Anurag Uniyal
@kaizer.se - That's a suggestion that's worth profiling or timing. Bear in mind that zip is a built-in function and is therefore faster to call than izip. I don't know which is likely to be faster, but I know not to trust my instincts in this area. :-)
Jason Baker
I don't see where you tested *my* solution, Anurag
SilentGhost
@SlientGhost: sorry see now, my comp hanged ;)
Anurag Uniyal
A: 

Use Reduce.

>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'

The same with numbers too as long as you know which char maps to which. [dict can be handy]

>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'
ThisIsMeMoony
I'd have to see numbers to believe that this is a fast solution. The overhead of a function call in Python can actually be pretty high.
Jason Baker
@Jason: curiously enough, reduce is faster so far
SilentGhost
+1  A: 

here are my timings. Note, it's py3.1

>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844
SilentGhost
A: 
string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")
pixelbeat
+7  A: 

@gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.

Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.

So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).

I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.

There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).

So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)

Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).

However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).

So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.

Alex Martelli
Alex, great response! Considering the array.array approach - it should only have one allocation total since I know how many chars I am going to have at output. So far I am going to test this one plus re and zip - the zip might give python a chance to properly treat the allocation and breakdown of the input. Will post results.
gnosis
@gnosis, making the array once (e.g. fill it w/'-'s if that's the char you need to append to every other char) then assigning to some of its items is promising. But what's the length distribution of strings of interest to you?
Alex Martelli
@alex-martelli, the length distribution is pretty much what you guessed - I'd say most of the time (like 80%) below 200 chars. Why does the filling matter? I would just allocate the memory once and then fill the data in by simple assignment. If I could allow output to be UTF-16 it could be a blast...
gnosis
@gnosis, there is no "allocating" an array without filling it with something. So you might as well fill it with the right character(s) from the start, e.g. `x=array.array('U', (2*len(s))*'-')` to allocate it filled with dashes, then `x[::2]=array.array('U', x)`, this should probably be the fastest approach (or other slice assignments if you're requiring something slightly different). You do need unicode internally, but a final .encode('utf-8') shouldn't slow you down too much (and neither should the initial .decode upon input).
Alex Martelli