views:

142

answers:

4

I was trying to figure out which integers python only instantiates once (-6 to 256 it seems), and in the process stumbled on some string behaviour I can't see the pattern in. Sometimes, equal strings created in different ways share the same id, sometimes not. This code:

A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = str(100) + "00"
H = "0".join(("10","00"))

for obj in (A,B,C,D,E,F,G,H):
    print obj, id(obj), obj is A

prints:

10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959456 False
10000 4959488 False
10000 4959520 False
10000 4959680 False

I don't even see the pattern - save for the fact that the first four don't have an explicit function call - but surely that can't be it, since the "+" in C for example implies a function call to add. I especially don't understand why C and G are different, seeing as that implies that the ids of the components of the addition are more important than the outcome.

So, what is the special treatment that A-D undergo, making them come out as the same instance?

+1  A: 

I believe short strings that can be evaluated at compile time, will be interned automatically. In the last examples, the result can't be evaluated at compile time because str or join might be redefined.

Greg Hewgill
+4  A: 

Python is allowed to inline string constants; A,B,C,D are actually the same literals (if Python sees a constant expression, it treats it as a constant).

str is actually a class, so str(whatever) is calling this class' constructor, which should yield a fresh object. This explains E,F,G (note that each of these has separate identity).

As for H, I am not sure, but I'd go for explanation that this expression is too complicated for Python to figure out it's actually a constant, so it computes a new string.

Maciej Pasternacki
Calling the constructor of an immutable type is fully allowed to NOT "yield a fresh object" -- `int(23) is int(23)` is True, while same with `str` is False -- in most current implementations at least; see my long answer for more details on such optimization heuristics (and _DON'T_ rely on them as they may change in the next point release!-).
Alex Martelli
str(aString) just seems to pass the string through as well, rather than creating a new one.
Markus
Alex, you're right, I didn't get the difference between mutable and immutable objects right here. Thanks for the correction.
Maciej Pasternacki
+9  A: 

In terms of language specification, any compliant Python compiler and runtime is fully allowed, for any instance of an immutable type, to make a new instance OR find an existing instance of the same type that's equal to the required value and use a new reference to that same instance. This means it's always incorrect to use is or by-id comparison among immutables, and any minor release may tweak or change strategy in this matter to enhance optimization.

In terms of implementations, the tradeoff are pretty clear: trying to reuse an existing instance may mean time spent (perhaps wasted) trying to find such an instance, but if the attempt succeeds then some memory is saved (as well as the time to allocate and later free the memory bits needed to hold a new instance).

How to solve those implementation tradeoffs is not entirely obvious -- if you can identify heuristics that indicate that finding a suitable existing instance is likely and the search (even if it fails) will be fast, then you may want to attempt the search-and-reuse when the heuristics suggest it, but skip it otherwise.

In your observations you seem to have found a particular dot-release implementation that performs a modicum of peephole optimization when that's entirely safe, fast, and simple, so the assignments A to D all boil down to exactly the same as A (but E to F don't, as they involve named functions or methods that the optimizer's authors may reasonably have considered not 100% safe to assume semantics for -- and low-ROI if that was done -- so they're not peephole-optimized).

Thus, A to D reusing the same instance boils down to A and B doing so (as C and D get peephole-optimized to exactly the same construct).

That reuse, in turn, clearly suggests compiler tactics/optimizer heuristics whereby identical literal constants of an immutable type in the same function's local namespace are collapsed to references to just one instance in the function's .func_code.co_consts (to use current CPython's terminology for attributes of functions and code objects) -- reasonable tactics and heuristics, as reuse of the same immutable constant literal within one function are somewhat frequent, AND the price is only paid once (at compile time) while the advantage is accrued many times (every time the function runs, maybe within loops etc etc).

(It so happens that these specific tactics and heuristics, given their clearly-positive tradeoffs, have been pervasive in all recent versions of CPython, and, I believe, IronPython, Jython, and PyPy as well;-).

This is a somewhat worthy and interesting are of study if you're planning to write compilers, runtime environments, peephole optimizers, etc etc, for Python itself or similar languages. I guess that deep study of the internals (ideally of many different correct implementations, of course, so as not to fixate on the quirks of a specific one -- good thing Python currently enjoys at least 4 separate production-worthy implementations, not to mention several versions of each!) can also help, indirectly, make one a better Python programmer -- but it's particularly important to focus on what's guaranteed by the language itself, which is somewhat less than what you'll find in common among separate implementations, because the parts that "just happen" to be in common right now (without being required to be so by the language specs) may perfectly well change under you at the next point release of one or another implementation and, if your production code was mistakenly relying on such details, that might cause nasty surprises;-). Plus -- it's hardly ever necessary, or even particularly helpful, to rely on such variable implementation details rather than on language-mandated behavior (unless you're coding something like an optimizer, debugger, profiler, or the like, of course;-).

Alex Martelli
Tangental but related: In Lua, strings are not only immutable, but they're unique: every string with the same value is the same object. This sounds slow, but Lua is known as a very fast scripting language; it does this with an extremely well-optimized hash table. It has the great property that *all* string equality comparisons are O(1). String equality turns into a pointer comparison.
Glenn Maynard
Definitely an interesting language-design choice, thanks for mentioning it.
Alex Martelli
+1  A: 

in answer to S.Lott's suggestion of examining the byte code:

import dis
def moo():
    A = "10000"
    B = "10000"
    C = "100" + "00"
    D = "%i"%10000
    E = str(10000)
    F = str(10000)
    G = "1000"+str(0)
    H = "0".join(("10","00"))
    I = str("10000")

    for obj in (A,B,C,D,E,F,G,H, I):
        print obj, id(obj), obj is A
moo()
print dis.dis(moo)

yields:

10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 2840928 False
10000 2840896 False
10000 2840864 False
10000 2840832 False
10000 4968128 True
  4           0 LOAD_CONST               1 ('10000')
              3 STORE_FAST               0 (A)

  5           6 LOAD_CONST               1 ('10000')
              9 STORE_FAST               1 (B)

  6          12 LOAD_CONST              10 ('10000')
             15 STORE_FAST               2 (C)

  7          18 LOAD_CONST              11 ('10000')
             21 STORE_FAST               3 (D)

  8          24 LOAD_GLOBAL              0 (str)
             27 LOAD_CONST               5 (10000)
             30 CALL_FUNCTION            1
             33 STORE_FAST               4 (E)

  9          36 LOAD_GLOBAL              0 (str)
             39 LOAD_CONST               5 (10000)
             42 CALL_FUNCTION            1
             45 STORE_FAST               5 (F)

 10          48 LOAD_CONST               6 ('1000')
             51 LOAD_GLOBAL              0 (str)
             54 LOAD_CONST               7 (0)
             57 CALL_FUNCTION            1
             60 BINARY_ADD          
             61 STORE_FAST               6 (G)

 11          64 LOAD_CONST               8 ('0')
             67 LOAD_ATTR                1 (join)
             70 LOAD_CONST              12 (('10', '00'))
             73 CALL_FUNCTION            1
             76 STORE_FAST               7 (H)

 12          79 LOAD_GLOBAL              0 (str)
             82 LOAD_CONST               1 ('10000')
             85 CALL_FUNCTION            1
             88 STORE_FAST               8 (I)

 14          91 SETUP_LOOP              66 (to 160)
             94 LOAD_FAST                0 (A)
             97 LOAD_FAST                1 (B)
            100 LOAD_FAST                2 (C)
            103 LOAD_FAST                3 (D)
            106 LOAD_FAST                4 (E)
            109 LOAD_FAST                5 (F)
            112 LOAD_FAST                6 (G)
            115 LOAD_FAST                7 (H)
            118 LOAD_FAST                8 (I)
            121 BUILD_TUPLE              9
            124 GET_ITER            
        >>  125 FOR_ITER                31 (to 159)
            128 STORE_FAST               9 (obj)

 15         131 LOAD_FAST                9 (obj)
            134 PRINT_ITEM          
            135 LOAD_GLOBAL              2 (id)
            138 LOAD_FAST                9 (obj)
            141 CALL_FUNCTION            1
            144 PRINT_ITEM          
            145 LOAD_FAST                9 (obj)
            148 LOAD_FAST                0 (A)
            151 COMPARE_OP               8 (is)
            154 PRINT_ITEM          
            155 PRINT_NEWLINE       
            156 JUMP_ABSOLUTE          125
        >>  159 POP_BLOCK           
        >>  160 LOAD_CONST               0 (None)
            163 RETURN_VALUE

so it would seem that indeed the compiler understands A-D to mean the same thing, and so it saves memory by only generating it once (as suggested by Alex,Maciej and Greg). (added case I seems to just be str() realising it's trying to make a string from a string, and just passing it through.)

Thanks everyone, that's a lot clearer now.

Markus