views:

375

answers:

4

(Assume that: application start-up time is absolutely critical; my application is started a lot; my application runs in an environment in which importing is slower than usual; many file need to be imported; and compilation to .pyc files is not available.)

I would like to concatenate all the python source files that define a collection of modules into a single new python source file.

I would like the result of importing the new file to be as if I imported one of the original files (which would then import some more of the original files, and so on).

Is this possible?

Here is a rough, manual simulation of what a tool might produce when fed the source files for modules 'bar' and 'baz'. You would run such a tool prior to deploying the code.

__file__ = 'foo.py'

def _module(_name):
    import types
    mod = types.ModuleType(name)
    mod.__file__ = __file__
    sys.modules[module_name] = mod
    return mod

def _bar_module():

    def hello():
        print 'Hello World! BAR'

    mod = create_module('foo.bar')
    mod.hello = hello
    return mod

bar = _bar_module()
del _bar_module

def _baz_module():

    def hello():
        print 'Hello World! BAZ'

    mod = create_module('foo.bar.baz')
    mod.hello = hello
    return mod

baz = _baz_module()
del _baz_module

And now you can:

from foo.bar import hello
hello()

This code doesn't take account of things like import statements and dependencies. Is there any existing code that will assemble source files using this, or some other technique?

This is very similar idea to tools being used to assemble and optimise javascript files before sending to the browser, where the latency of multiple HTTP requests hurts performance. In this python case, it's the latency of importing hundreds of python source files at startup which hurts.

A: 

I think that due to the precompilation of Python files and some system caching, the speed up that you'll eventually get won't be measurable.

Simon
did you mean "won't be measurable"?
Bryan Oakley
Precompilation to .pyc files is not available. Caching is not as good as I would hope for, in this particular situation. I've clarified this in the question.
Your Mom's Mom
Yes, of course "won't be measurable" :)
Simon
+3  A: 

If this is on google app engine as the tags indicate, make sure you are using this idiom

def main(): 
    #do stuff
if __name__ == '__main__':
    main()

Because GAE doesn't restart your app every request unless the .py has changed, it just runs main() again.

This trick lets you write CGI style apps without the startup performance hit

AppCaching

If a handler script provides a main() routine, the runtime environment also caches the script. Otherwise, the handler script is loaded for every request.

gnibbler
Thanks, but I already do this. Note that app engine often only caches for a second or so, giving 14,000 or so opportunities to cache miss every day, which is exacerbated when using many modules with complex import dependencies.I'm really interested in maximum startup performance.
Your Mom's Mom
A: 

Doing this is unlikely to yield any performance benefits. You're still importing the same amount of Python code, just in fewer modules - and you're sacrificing all modularity for it.

A better approach would be to modify your code and/or libraries to only import things when needed, so that a minimum of required code is loaded for each request.

Nick Johnson
I am not sacrificing all modularity because I am proposing to process my source code using a tool prior to deployment, very much in the way that modern javascript tools combine, split and optimise javascript prior to deployment to that actual browser. It's a fine idea to say "write your code to only import what you need", but then that precludes using the 300 od source files in core django, not including contrib modules, external libraries, and anything I might write. Besides, you often do want to import a lot of code because you actually do use it.
Your Mom's Mom
Nick, you work at Google, can you describe the process of loading deploying user code on App Engine please? It's not merely the cost of disk seeks as multiple modules are loaded, as in a conventional setup, because all user code is not deployed to all instances simultaneously. How is it actually done? How is it actually done? Are all files deployed together, one at a time, or in batches? How does the custom zip code work? when you load one module from a zip, are all module within loaded, or does it happen on demand? etc...
Your Mom's Mom
If having '300 odd' source files is a problem for doing on-demand importing, it's equally problematic if you want to concatenate them all - either would require substantial modifications. The problem with using a large framework like Django is that, as you've observed, it takes a long time to import. I doubt you're using more than 20% or so of the modules for any given request, though.
Nick Johnson
Not sure where you get the idea that code is deployed 'on demand' - you can use the os.listdir etc functions to see that this isn't the case. Likewise, you can examine the zipimport code yourself to see how it works. Both are fairly straightforward - and concatenating all the source is an 'optimisation' that is unlikely to help at all.
Nick Johnson
"Not sure where you get the idea that code is deployed 'on demand' - you can use the os.listdir etc functions to see that this isn't the case."Are you saying that 10's of thousands of apps written by 10's of thousands of developers are simultaneously and persistently deployed to the normal block size, ext2/3/4 file systems on the local hard disks of each of the thousands of servers in the app engine cluster? ie. that the app engine production environment is substantially similar to the average server setup? I don't see how this is possible at this scale...
Your Mom's Mom
"If having '300 odd' source files is a problem for doing on-demand importing, it's equally problematic if you want to concatenate them all..."This is a question about batching. It is more efficient to db.Get([list of keys]) than to: for key in list_of_keys: db.Get(key); for example.
Your Mom's Mom
Your entire app is deployed to the servers it's running on. Your optimisation is based on a premise that doesn't apply.
Nick Johnson
Great. What were the recent changes made to the production servers where at one point importing a single module caused 30 second timeouts, and then later an optimisation was added and importing became fatser, even than before the 30 second timeouts? If it really is such a vanilla setup, where is the scope for such wild variations in import performance?
Your Mom's Mom
It was a bug. I can't go into detail about what the cause was - but it's now fixed.
Nick Johnson
It was a bug in linux stat() or read()? Or it was a bug in the unique app engine shim between the python interp and the kernel? Whatever - it's a secret we're not allowed to know. But it shows that the app engine environment is not a typical server environment, and it has different performance characteristics. One of which is that app startup continues to be poor, even with the recent bug fixed.
Your Mom's Mom
There are many other that could result in slow import performance besides the ones you list. The occurrence of the bug in no way means that some unique component of the infrastructure was to blame.
Nick Johnson
So was the bug in some part of the appengine provisioning/module-loading/python-fs-access code, or was it in some completely unrelated piece of the infrastructure which just so happened to have a random knock-on effect? I don't think you will be giving away any secrets if you share the general area of the code involved.
Your Mom's Mom
The bug was unrelated to importing specifically - it just happens that the affected apps spend most of their time in imports, and so are disproportionately likely to be hit by it there.
Nick Johnson
A: 

Without dealing with the question, whether or not this technique would boost up things at your environment, say you are right, here is what I would have done.

I would make a list of all my modules e.g. my_files = ['foo', 'bar', 'baz']

I would then use os.path utilities to read all lines in all files under the source directory and writes them all into a new file, filtering all import foo|bar|baz lines since all code is now within a single file.

Of curse, at last adding the main() from __init__.py (if there is such) at the tail of the file.

Tzury Bar Yochay