ansaurus

Question

How efficient is Python substring extraction?

Answer 1

+3 A:

At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.

Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.

Glenn Maynard 2010-03-16 19:30:03

Thanks for the info. I'm actually using Python 2.5 (I've updated my question) but I doubt it's done differently. I'll just have to live with the duplication, I guess (I _really_ need to remove that one character).

Cameron 2010-03-16 19:37:38

Can't you just read the first character out of the file, and not assign it to the string to begin with? See my answer, coming momentarily. **edit:** see benson's answer instead.

jcdyer 2010-03-16 20:53:56

Answer 2

+2 A:

As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.

What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.

As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?

TokenMacGuy 2010-03-16 19:35:31

I'm removing the BOM from a UTF-8 decoded file in memory, then sending the contents of this file into a templating engine (Jinja2), then writing the result to an HTML response. I just figured out a way that I'll only have to do this once per template file, though, so it's not really an issue anymore :-)

Cameron 2010-03-16 19:39:25

Answer 3

+1 A:

One (albeit slightly hacky) solution would be something like this:

f = open("test.c")
f.read(1)
myStr = f.read()
print myStr

It will skip the first character, and then read the data into your string variable.

Benson 2010-03-16 19:50:39

Actually, that will read the first byte, not necessarily the first character. In a utf-8 encoded file only 128 US-ASCII characters are encoded in one byte.

tgray 2010-03-16 20:24:59

So read the first line, convert to unicode, and then strip the first character. Proceed more or less as above, converting to unicode as you go along. If you don't convert, then you're dealing with bytes.

jcdyer 2010-03-16 20:56:59

I would use this technique, but at the time I'm reading it from file I don't know whether the BOM should be kept or not. When I later retrieve the contents (from a DB), I get the entire file back at once. A version of your technique has actually already been presented to me in the answer to another (related) question I asked earlier: http://stackoverflow.com/questions/2456380/utf-8-html-and-css-files-with-bom-and-how-to-remove-the-bom-with-python/2456524#2456524

Cameron 2010-03-16 22:52:18

Always use a context manager when dealing with files, i.e. `with open("test.c") as f:`

Mike Graham 2010-03-16 23:22:37

@Mike: I would have, but he said he was using 2.5, and I didn't want to muck about with the from __future__ import with_statement junk.

Benson 2010-03-18 21:20:42

@Benson, that isn't junk; that is making your code right. If he was using pre-2.5 Python, I would have said `close` needs to go inside a `finally` block.

Mike Graham 2010-03-18 21:32:50

You might not be reading from a file in the first place; there are many (though infrequent) cases where being able to make string slices for "free" makes it easier to get an efficient algorithm.

Glenn Maynard 2010-03-19 01:10:54

Answer 4

+1 A:

Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).

Mike Graham 2010-03-16 23:21:58

Cool, I didn't know that module even existed!

Cameron 2010-03-17 00:40:46

Good find, then!—`itertools` is constantly useful.

Mike Graham 2010-03-17 00:53:44

ansaurus

tags:

views:

answers:

How efficient is Python substring extraction?

related questions