views:

311

answers:

4

I've got the entire contents of a text file (at least a few KB) in string myStr.

Will the following code create a copy of the string (less the first character) in memory?

myStr = myStr[1:]

I'm hoping it just refers to a different location in the same internal buffer. If not, is there a more efficient way to do this?

Thanks!

Note: I'm using Python 2.5.

+3  A: 

At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.

Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.

Glenn Maynard
Thanks for the info. I'm actually using Python 2.5 (I've updated my question) but I doubt it's done differently. I'll just have to live with the duplication, I guess (I _really_ need to remove that one character).
Cameron
Can't you just read the first character out of the file, and not assign it to the string to begin with? See my answer, coming momentarily. **edit:** see benson's answer instead.
jcdyer
+2  A: 

As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.

What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.

As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?

TokenMacGuy
I'm removing the BOM from a UTF-8 decoded file in memory, then sending the contents of this file into a templating engine (Jinja2), then writing the result to an HTML response. I just figured out a way that I'll only have to do this once per template file, though, so it's not really an issue anymore :-)
Cameron
+1  A: 

One (albeit slightly hacky) solution would be something like this:

f = open("test.c")
f.read(1)
myStr = f.read()
print myStr

It will skip the first character, and then read the data into your string variable.

Benson
Actually, that will read the first byte, not necessarily the first character. In a utf-8 encoded file only 128 US-ASCII characters are encoded in one byte.
tgray
So read the first line, convert to unicode, and then strip the first character. Proceed more or less as above, converting to unicode as you go along. If you don't convert, then you're dealing with bytes.
jcdyer
I would use this technique, but at the time I'm reading it from file I don't know whether the BOM should be kept or not. When I later retrieve the contents (from a DB), I get the entire file back at once. A version of your technique has actually already been presented to me in the answer to another (related) question I asked earlier: http://stackoverflow.com/questions/2456380/utf-8-html-and-css-files-with-bom-and-how-to-remove-the-bom-with-python/2456524#2456524
Cameron
Always use a context manager when dealing with files, i.e. `with open("test.c") as f:`
Mike Graham
@Mike: I would have, but he said he was using 2.5, and I didn't want to muck about with the from __future__ import with_statement junk.
Benson
@Benson, that isn't junk; that is making your code right. If he was using pre-2.5 Python, I would have said `close` needs to go inside a `finally` block.
Mike Graham
You might not be reading from a file in the first place; there are many (though infrequent) cases where being able to make string slices for "free" makes it easier to get an efficient algorithm.
Glenn Maynard
+1  A: 

Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).

Mike Graham
Cool, I didn't know that module even existed!
Cameron
Good find, then!—`itertools` is constantly useful.
Mike Graham