views:

832

answers:

14

I have a large set of data (a data cube of 250,000 X 1,000 doubles, about a 4 gig file) and I want to manipulate it using a previous set of OOP classes I have written in Python. Currently the data set is already so large that to read into my machine memory I have to at least split it in half so computing overhead is a concern. My OOP classes create new objects (in this case I will need 250,000 new objects, each object is an array of 1,000 doubles) to handle the data. What is the overhead in terms of memory and computing required in creating objects for a generic OOP language? In python? What about in C++?

Yes, I realize I could make a new class that is an array. But 1) I already have these classes finished and 2) I put each object that I create back into an array for access later anyways. The question is pedagogical

*update: I want to be efficient with time, my time and the computers. I don't want to rewrite a program I already have if I don't have to and spending time optimizing the code wastes my time, I don't care that much if I waste the computers time. I actually do have a 64bit machine with 4Gig ram. The data is an image and I need to do several filters on each pixel.*

A: 

Impossible to answer without knowing the shape of the data and the structure that you've designed to contain it.

Greg Hurlman
Could you elucidate more here?
Alex
Your question is his question :)
e-satis
+2  A: 

I wouldn't consider it fair to blame any shortcomings of your design to OOP. Just like any other programming platform out there OO can be used for both good and less than optimal design. Rarely will this be the fault of the programming model itself.

But to try to answer your question: Allocating 250000 new object requires some overhead in all OO language that I'm aware of, so if you can get away with streaming the data through the same instance, you're probably better off.

Brian Rasmussen
It doesn't even have to be OOP - you'd have the same overhead on *any* platform if you allocated memory for each member of a set this large.
GalacticCowboy
+3  A: 

You'd have similar issues with procedural/functional programming languages. How do you store that much data in memory? A struct or array wouldn't work either.

You need to take special steps to manage this scale of data.

BTW: I wouldn't use this as a reason to pick either an OO language or not.

Stephane Grenier
A: 

The "overhead" depends largely on the platform and the implementation you chose.

Now if you have a memory problem reading millions of data from a multiple Gb file, you're having an algorithm problem where the memory consumption of objects is definitely not the biggest concern, the concern yould be more about how you do fetch, process and store the data.

Loki
A: 

Like the other posters have stated. I do not believe Objects are going to lend a significant amount of overhead to your process. It will need to store a pointer to the object but the rest of the 'doubles' will be taking 99% of your program's memory.

Can you partition this data into much smaller subsets? What is the task that you are trying to accomplish? I would be interested in seeing what you need all the data in memory for. Perhaps you can just serialize it, or use something like lazy evaluation in haskell.

Please post a follow up so we can understand your problem domain better.

semmons99
+2  A: 

See http://code.activestate.com/recipes/546530/

This is the approximate size of Python objects.

The OO size "penalty" is often offset by the ability to (a) simplify processing and (b) keep less stuff in memory in the first place.

There is no OO performance overhead. Zero. In C++, the class definitions are optimized out of existence, and all you have left is C. In Python -- like all dynamic languages -- the dynamic programming environment adds some run-time lookups. Mostly, these are direct hashes into dictionaries. It's slower than code where a compiler did all the resolving for you. However it's still very fast with relatively low overhead.

A bad algorithm in C can easily be slower than the right algorithm in Python.

S.Lott
A: 

I don't think the question is overhead coming from OO.

If we accept C++ as an OO language and remember that the C++ compiler is a preprocessor to C (at least it used to be, when I used C++), anything done in C++ is really done in C. C has very little overhead. So it would depend on the libraries.

I think any overhead would come from interpretation, managed execution or memory management. For those that have the tools and the know-how, it would be very easy to find out which is most efficient, C++ or Python.

I can't see where C++ would add much avoidable overhead. I don't know much about Python.

Guge
A: 

compared to the size of your data set, the overhead of 250K objects is negligible

i think you're on the wrong path; don't blame objects for that ;-)

Steven A. Lowe
A: 

Please define "manipulate". If you really want to manipulate 4 gigs of data why do you want to manipulate it by pulling it ALL into memory right away?

I mean, who needs 4 gig of RAM anyway? :)

Hace
+3  A: 

Slightly OT: the flyweight design pattern can be useful for minimising overheads when you're manipulating large datasets. Without knowing the details of your problem I'm not sure how applicable it is, but it's worth a look...

Dan Vinton
+1  A: 

Actual C++ OO memory overhead is one pointer (4-8 bytes, depending) per object with virtual methods. However, as mentioned in other answers, the default memory allocation overhead from dynamic allocation is likely to be significantly greater than this.

If you're doing things halfway reasonably, neither overhead is likely to be significant compared with an 1000*8-byte double array. If you're actually worried about allocation overhead, you can write your own allocator -- but, check first to see if it will actually buy you a significant improvement.

comingstorm
A: 

If you have to manipulate data sets this big on a regular basis, could you just get a 64-bit machine with bucket-loads of RAM? For various reasons, I've found myself working with fairly resource hungry software (in this case SQL Server Analysis Services). Older 64-bit machines of this sort can take large amounts of RAM and feature CPUs that while not cutting-edge are still respectably fast.

I got some secondhand HP workstations and fitted them with several fast SCSI disks. In mid-2007 these machines with 4 or 8GB of RAM and 5x 10K or 15K SCSI disks cost between £1,500-£2,000 to buy. The disks were half the cost of the machines and you may not need the I/O performance so you probably won't need to spend this much. XW9300's of the sort that I bought can be purchased of ebay quite cheaply now - this posting of mine goes into various options for using ebay to get a high-spec 64-bit box on the cheap. You can get memory upgrades to 16 or 32GB for these machines off ebay for quite a small fraction of the list price of the parts.

ConcernedOfTunbridgeWells
A: 

A friend of mine was a professor at MIT and a student asked him why his image analysis program was running so slow. How was it built? Every pixel was an object, and would send messages to its neighbors!

If I were you I'd try it in a throw-away program. My suspicion is, unless your classes are very carefully coded, you're going to find it spending a lot of time allocating, initializing, and de-allocating objects, and as Brian said, you might be able to spool the data through a set of re-used objects.

Edit: Excuse me. You said you are re-using objects, so that's good. In any case, when you get it running you could profile it or (if you were me) just read the call stack a few random times, and that will answer any questions about where the time goes.

Mike Dunlavey
A: 

Since you can split the data in half and operate on it, I'm assuming that you're working on each record individually? It sounds to me like you need to change your deserialiser to read one record at a time, manipulate it, and then store out the results.

Basically you need a string parser class that does a Peek() which returns a char, knows how to skip whitespace, etc. Wrap a class around that that understands your data format, and you should be able to have it spit out an object at a time as it reads the file.

Travis