views:

322

answers:

5

Provided that we know that all the file will be loaded in memory and we can afford it, what are the drawbacks (if any) or limitations (if any) of loading an entire file (possibly a binary file) in a python variable. If this is technically possible, should this be avoided, and why ?

Regarding file size concerns, to what maximum size this solution should be limited ?. And why ?

The actual loading code could be the one proposed in this stackoverflow entry.

Sample code is:

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

content = file_get_contents('/bin/kill')

... code manipulating 'content' ...

[EDIT] Code manipulation that comes to mind (but is maybe not applicable) is standard list/strings operators (square brackets, '+' signs) or some string operators ('len', 'in' operator, 'count', 'endswith'/'startswith', 'split', 'translation' ...).

+9  A: 
  • Yes, you can
  • The only drawback is memory usage, and possible also speed if the file is big.
  • File size should be limited to how much space you have in memory.

In general, there are better ways to do it, but for one-off scripts where you know memory is not an issue, sure.

Lennart Regebro
+3  A: 

The sole issue you can run into is memory consumption: Strings in Python are immutable. So when you need to change a byte, you need to copy the old string:

new = old[0:pos] + newByte + old[pos+1:]

This needs up to three times the memory of old.

Instead of a string, you can use an array. These offer much better performance if you need to modify the contents and you can create them easily from a string.

Aaron Digulla
+3  A: 
with open(filename) as f:

This only works on Python 2.x on Unix. It won't do what you expect on Python 3.x or on Windows, as these both draw a strong distinction between text and binary files. It's better to specify that the file is binary, like this:

with open(filename, 'rb') as f:

This will turn off the OS's CR/LF conversion on Windows, and will force Python 3.x to return a byte array rather than Unicode characters.

As for the rest of your question, I agree with Lennart Regebro's (unedited) answer.

user9876
A: 

Yes you can -provided the file is small enough-.

It is even very pythonic to further convert the return from read() to any container/iterable type as with say, string.split(), along with associated functional programming features to continue treating the file "at once".

mjv
+1  A: 

While you've gotten good responses, it seems nobody has answered this part of your question (as often happens when you ask many questions in a question;-)...:

Regarding file size concerns, to what maximum size this solution should be limited ?. And why ?

The most important thing is, how much physical RAM can this specific Python process actually use (what's known as a "working set"), without unduly penalizing other aspects of the overall system's performance. If you exceed physical RAM for your "working set", you'll be paginating and swapping in and out to disk, and your performance can rapidly degrade (up to a state known as "thrashing" were basically all available cycles are going to the tasks of getting pages in and out, and negligible amounts of actual work can actually get done).

Out of that total, a reasonably modest amount (say a few MB at most, in general) are probably going to be taken up by executable code (Python's own executable files, DLLs or .so's) and bytecode and general support datastructures that are actively needed in memory; on a typical modern machine that's not doing other important or urgent tasks, you can almost ignore this overhead compared to the gigabytes of RAM that you have available overall (though the situation might be different on embedded systems, etc).

All the rest is available for your data -- which includes this file you're reading into memory, as well as any other significant data structures. "Modifications" of the file's data can typically take (transiently) twice as much memory as the file's contents' size (if you're holding it in a string) -- more, of course, if you're keeping a copy of the old data as well as making new modified copies/versions.

So for "read-only" use on a typical modern 32-bit machine with, say, 2GB of RAM overall, reading into memory (say) 1.5 GB should be no problem; but it will have to be substantially less than 1 GB if you're doing "modifications" (and even less if you have other significant data structures in memory!). Of course, on a dedicated server with a 64-bit build of Python, a 64-bit OS, and 16 GB of RAM, the practical limits before very different -- roughly in proportion to the vastly different amount of available RAM in fact.

For example, the King James' Bible text as downloadable here (unzipped) is about 4.4 MB; so, in a machine with 2 GB of RAM, you could keep about 400 slightly modified copies of it in memory (if nothing else is requesting memory), but, in a machine with 16 (available and addressable) GB of RAM, you could keep well over 3000 such copies.

Alex Martelli