views:

1583

answers:

8

Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:

public class Record {
  final private int field1;
  final private int field2;
  final private long field3;
  /* constructor & accessors here */
}

List<Record> records = new ArrayList<Record>();

If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:

public class OptimizedRecordStore {
  final private int[] field1;
  final private int[] field2;
  final private long[] field3;

  Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
  /* constructor and other accessors & methods */
}

edit:

  • assume the # of records is something that is changed infrequently or never
  • I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
  • obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
  • kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
+2  A: 

Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.

Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object). Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO

EFraim
Not if you are doing operations on just one field.
fortran
why the same memory use asymptotically? for the first approach, one record = 16 bytes + some object overhead for each record + some overhead for the ArrayList. For the second approach, it's 16 bytes * the number of records + some overhead for the OptimizedRecordStore. If object overhead is 8 bytes then the 1st approach has roughly 50% more memory use... maybe that's OK but I'd like to figure out what it is.
Jason S
+7  A: 

The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.

Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?

I suspect it's the latter in which case the former makes far more sense.

If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.

cletus
Jason S
@Jason: I told you everything you need to know about optimization in this case. The first version will resize one array (that's ultimately what an ArrayList is). The second version resizes three arrays AND creates loads of temporary objects. And it does so with no apparent benefits (that I can see). You need to look no further than that.
cletus
@Jason S - you should ignore optimization until you've actually profiled your application and discovered what the *real* problems are. Otherwise you are probably wasting your time and reducing the adaptability and simplicity of your code.
whaley
@whaley: if you replace "you should ignore optimization" with "you should ignore optimization in most cases" I agree with you. Like I said, I need to get some intuitive sense of the cost of things.
Jason S
+1  A: 

I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?

Edit You could also use a common long[][] matrix. I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.

kd304
+1  A: 

Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.

AlbertoPL
+2  A: 

Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.

Greg Reynolds
+1  A: 

How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.

See this article in wikipedia: Parallel Array

A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.

fortran
+1  A: 

I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.

First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.

Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.

Some other possibilities to consider:

An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.

A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".

mikera
+2  A: 

(Not a direct answer, but one that I think should be given)

From your comment,

"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"

You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:

  1. Writing a program that had its primary purpose to be faster than another product
  2. Writing a smartphone app with the intention of reducing the amount of data sent between the client and server

In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.

In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.

If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.

Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.

TofuBeer
+1 -- but I disagree with "You should always ignore optimization" -- particularly "always". I agree with the rest, but please understand that experienced programmers make tons of unconscious decisions based on that experience, that us less-experienced programmers have to mumble our way through until we learn. I've had several applications in the past 12 months that I've had to optimize because they won't work at all -- I'm dealing with a system that has to process hundreds of KB per second and every time I take the "don't optimize until later" approach I end up having to redesign my code.
Jason S
I agree with the unconscious part... but I disagree that it should be a conscious effort (most of the time) to find out what the fastest way is before the code is "done". I recently did a new system where I need things to finish within 5 minutes, I started at about 12 minutes and it is now down at about 3.5 minutes. In the process I re-wrote 100% of the code bit by bit until things were fast. Each iteration also made things much better. In the end I wound up with something much different then I expected, and much nicer as well.
TofuBeer
One other thought, you would not go out of your way to make things slow, for example choosing a List instead of a Set when you don't want duplicates (and thus having to iterate over the List before inserting) but that is different then worrying about data representation etc... If you always make the choice for simpler code and then find out where it is slow you will have an easier time speeding it up where it needs to be sped up.
TofuBeer