views:

478

answers:

5

In order to save space and the complexity of having to maintain the consistency of data between different sources, I'm considering storing start/end indices for some substrings instead of storing the substrings themselves. The trick is that if I do so, it's possible I'll be creating slices ALL the time. Is this something to be avoided? Is the slice operator fast enough I don't need to worry? How about the new object creation/destruction overhead?


Okay, I learned my lesson. Don't optimize unless there's a real problem you're trying to fix. (Of course this doesn't mean to right needlessly bad code, but that's beside the point...) Also, test and profile before coming to stack overflow. =D Thanks everyone!

+7  A: 
  1. Fast enough as opposed to what? How do you do it right now? What exactly are you storing, what exactly are you retrieving? The answer probably highly depends on this. Which brings us to ...

  2. Measure! Don't discuss and analyze theoretically; try and measure what is the more performant way. Then decide whether the possible performance gain justifies refactoring your database.

Edit: I just ran a test measuring string slicing versus lookup in a dict keyed on (start, end) tuples. It suggests that there's not much of a difference. It's a pretty naive test, though, so take it with a pinch of salt.

balpha
The current method is just storing a text, sentences, and tokens in the sentence all as separate strings (in the database), and linking each one to the other. Seems like a lot of unnecessary bloat to me; a 2mb text ends up taking up 28mb of database. Anyway, what would be retrieved all the time is individual sentences from the text. The alternative is slicing the text based on the stored indices.But you've got a really good point. Measuring is probably the best way to go. =P
tehgeekmeister
Don't underestimate the decision part as well: If there's a performance / storage space tradeoff (and most of the time there is), you have to take into account what resources you have. 28mb is not a lot if you really need the CPU time, but have a terabyte hard drive at your disposal. 28 mb *is* a lot if you're running a small embedded system that only gets accessed once a day. Well, I guess the whole thing I just wrote boils down to "It always depends" :-)
balpha
@tehgeekmeister: Please update your question with these additional facts.
S.Lott
+2  A: 

premature optimization is the rool of all evil.

Prove to yourself that you really have a need to optimize code, then act.

iElectric
+1  A: 

I haven't done any measurements either, but since it sounds like you're already taking a C approach to a problem in Python, you might want to take a look at Python's built-in mmap library:

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

I'm not sure from your question if that's exactly what you're looking for. And it bears repeating that you need to take some measurements. Python's timeit library is the easy one to use, but there's also cProfile or hotshot, although hotshot is at risk of being removed from the standard library as I understand it.

Mark Rushakoff
+1  A: 

In a comment the OP mentions bloat "in the database" -- but no information regarding what database he's talking about; from the scant information in that comment it would seem that Python string slices aren't necessarily what's involved, rather, the "slicing" would be done by the DB engine upon retrieval.

If that's the actual situation then I would recommend on general principles against storing redundant information in the DB -- a "normal form" (maybe in a lax sense of the expression;-) whereby information is stored just once and derived information is recomputed (or cached charge of the DB engine, etc;-) should be the norm, and "denormalization" by deliberately storing derived information very much the exception and only when justified by specific, well measured retrieval-performance needs.

If the reference to "database" was a misdirection;-), or rather used in a lax sense as I did for "normal form" above;-), then another consideration may apply: since Python strings are immutable, it would seem to be natural to not have to do slices by copying, but rather have each slice reuse part of the memory space of the parent it's being sliced from (much as is done for numpy arrays' slices). However that's not currently part of the Python core. I did once try a patch to that purpose, but the problem of adding a reference to the big string and thus making it stay in memory just because a tiny substring thereof is still referenced loomed large for general-purpose adaptation. Still it would be possible to make a special purpose subclass of string (and one of unicode) for the case in which the big "parent" string needs to stay in memory anyway. Currently buffer does a tiny bit of that, but you can't call string methods on a buffer object (without explicitly copying it to a string object first), so it's only really useful for output and a few special cases... but there's no real conceptual block against adding string method (I doubt that would be adopted in the core, but it should be decently easy to maintain as a third party module anyway;-).

The worth of such an approach can hardly be solidly proven by measurement, one way or another -- speed would be very similar to the current implicitly-copying approach; the advantage would come entirely in terms of reducing memory footprint, which wouldn't so much make any given Python code faster, but rather allow a certain program to execute on a machine with a bit less RAM, or multi-task better when several instances are being used at the same time in separate processes. See rope for a similar but richer approach once experimented with in the context of C++ (but note it didn't make it into the standard;-).

Alex Martelli
+1  A: 

Would slices be ineffective because they create copies of the source string? This may or may not be an issue. If it turns out to be an issue, would it not be possible to simply implement a "String view"; an object that has a reference to the source string and has a start and end point.. Upon access/iteration, it just reads from the source string.

kaizer.se
This was my worry. But I think everyone was right: I don't need to optimize yet, and if I did I should've measured and tested before I came here.
tehgeekmeister