views:

290

answers:

6

Since RAM seems to be the new disk, and since that statement also means that access to memory is now considered slow similarly to how disk access has always been, I do want to maximize locality of reference in memory for high performance applications. For example, in a sorted index, I want adjacent values to be close (unlike say, in a hashtable), and I want the data the index is pointing to close by, too.

In C, I can whip up a data structure with a specialized memory manager, like the developers of the (immensely complex) Judy array did. With direct control over the pointers, they even went so far as to encode additional information in the pointer value itself. When working in Python, Java or C#, I am deliberately one (or more) level(s) of abstraction away from this type of solution and I'm entrusting the JIT compilers and optimizing runtimes with doing clever tricks on the low levels for me.

Still, I guess, even at this high level of abstraction, there are things that can be semantically considered "closer" and therefore are likely to be actually closer at the low levels. For example, I was wondering about the following (my guess in parentheses):

  • Can I expect an array to be an adjacent block of memory (yes)?
  • Are two integers in the same instance closer than two in different instances of the same class (probably)?
  • Does an object occupy a contigous region in memory (no)?
  • What's the difference between an array of objects with only two int fields and a single object with two int[] fields? (this example is probably Java specific)

I started wondering about these in a Java context, but my wondering has become more general, so I'd suggest to not treat this as a Java question.

+7  A: 
  • In .NET, elements of an array are certainly contiguous. In Java I'd expect them to be in most implementations, but it appears not to be guaranteed.
  • I think it's reasonable to assume that the memory used by an instance for fields is in a single block... but don't forget that some of those fields may be references to other objects.

For the Java array part, Sun's JNI documentation includes this comment, tucked away in a discussion about strings:

For example, the Java virtual machine may not store arrays contiguously.

For your last question, if you have two int[] then each of those arrays will be a contiguous block of memory, but they could be very "far apart" in memory. If you have an array of objects with two int fields, then each object could be a long way from each other, but the two integers within each object will be close together. Potentially more importantly, you'll end up taking a lot more memory with the "lots of objects" solution due to the per-object overhead. In .NET you could use a custom struct with two integers instead, and have an array of those - that would keep all the data in one big block.

I believe that in both Java and .NET, if you allocate a lot of smallish objects in quick succession within a single thread then those objects are likely to have good locality of reference. When the GC compacts a heap, this may improve - or it may potentially become worse, if a heap with

A B C D E

is compacted to

A D E B

(where C is collected) - suddenly A and B, which may have been "close" before, are far apart. I don't know whether this actually happens in any garbage collector (there are loads around!) but it's possible.

Basically in a managed environment you don't usually have as much control over locality of reference as you do in an unmanaged environment - you have to trust that the managed environment is sufficiently good at managing it, and that you'll have saved enough time by coding to a higher level platform to let you spend time optimising elsewhere.

Jon Skeet
Are you sure about Java arrays being contiguous?Although most VMs probably allocate Java arrays in a contiguous block of memory, I can't find any requirement for this in the language or the VM specification. If a contiguous block was guaranteed, the JNI functions for accessing arrays seem rather superfluous, as rather a pointer to the memory could be passed to the native function.
jarnbjo
Nice answer, but....NET allocates memory for an object just by grabbing a chunk of contiguous memory from the bottom (or top) of the managed heap. So an obujects member variables will always be adjacent in RAM. However, for a reference type, this means the pointer is adjacent. where it points to will depend on when the object it points to was constructed. An array is an object, so the same will apply to an array with the array elements being its members.
pipTheGeek
@pip: I thought I'd covered all of that, hadn't I? (See my second bullet.)
Jon Skeet
@jarnbjo: Will check...
Jon Skeet
@jarnbjo: You're right. Will edit.
Jon Skeet
@jarnbjo, @Jon: JesperE says below that in JNI you get ordinary C pointers to access arrays of primitives, which would imply they at least seem to be guaranteed to be contigous.
Hanno Fietz
@Hanno: Actually, I've just found evidence that they may *not* be contiguous.... editing.
Jon Skeet
Non-contiguous arrays are probably for supporting 286 segmented modes.
Tom Hawtin - tackline
@Hanno: I was a little bit short. JNI allows you to access arrays using C pointer, although you have to explicitly request it, allowing the JVM to copy the array for you and then copy it back when you release it.
JesperE
+3  A: 

First, your title is implying C#. "Managed code" is a term coined by Microsoft, if I'm not mistaken.

Java primitive arrays are guaranteed to be a continuous block of memory. If you have a

int[] array = new int[4];

you can from JNI (native C) get a int *p to point to the actual array. I think this goes for the Array* class of containers as well (ArrayList, ArrayBlockingQueue, etc).

Early implementations of the JVM had objects as contiuous struct, I think, but this cannot be assumed with newer JVMs. (JNI abstracts away this).

Two integers in the same object will as you say probably be "closer", but they may not be. This will probably vary even using the same JVM.

An object with two int fields is an object and I don't think any JVM makes any guarantee that the members will be "close". An int-array with two elements will very likely be backed by a 8 byte long array.

JesperE
jarnbjo and Jon Skeet in Jon's answer were discussing Java arrays and concluded that they are likely, but not guaranteed, to be contigous. Is there a difference between arrays of primitive versus reference types in JNI?
Hanno Fietz
"Managed code" is a term used to denote any language that runs in a managed environment - ie a JVM or the .NET framework. MS uses it more than most to distinguish between it and native coding for Windows, whereas Java doesn't need to make such distinctions.
gbjbaanb
I've never heard the term "managed code" outside the .NET realm. Before .NET came along, Java people talked about JVMs and bytecode, not managed code.
JesperE
@Hanno: no, I don't think JNI makes a difference between primitive and reference types in arrays. Reference types are treated as `jobject[]` in JNI, AFAIR.
JesperE
+2  A: 

With regards to arrays here is an excerpt from CLI (Common Language Infrastructure) specification:

Array elements shall be laid out within the array object in row-major order (i.e., the elements associated with the rightmost array dimension shall be laid out contiguously from lowest to highest index). The actual storage allocated for each array element can include platform-specific padding. (The size of this storage, in bytes, is returned by the sizeof instruction when it is applied to the type of that array’s elements.

Dzmitry Huba
+2  A: 

Good question! I think I would resort to writing extensions in C++ that handle memory in a more carefully managed way and just exposing enough of an interface to allow the rest of the application to manipulate the objects. If I was that concerned about performance I would probably resort to a C++ extension anyway.

Kylotan
Nice suggestion, too, especially because it would allow you to port your solution between multiple VM languages. But of course, that approach can easily involve a ton of work, and then it can be very handy to know a little more about what you can do inside your current environment.
Hanno Fietz
See also my comment on Nick Craig-Wood's post about Python - many high level languages will provide libraries that attempt to emulate some sort of low-level functionality and they are likely to provide similar benefits as long as you don't waste too much CPU time converting data going in and coming out.
Kylotan
A: 

If you need to optimise to that level then I suspect a VM based language is not for you ;)

wefwfwefwe
Well that's an easy answer, of course, but hardly an interesting one. Choosing a language for a specific project is a decision with multiple factors, performance being among them, but not always the most important one. And when you really *need* to optimize to that level is an equally complex decision. You can go a very long way on big iron, you can try to get more scalability and add more machines etc. But if you know what does / maybe does / certainly doesn't improve locality of reference in the language you happen to be using, that just gives you more building bricks for a solution.
Hanno Fietz
Whatever. If you want to micro manage memory then use a language where you manually manage memory.
wefwfwefwe
+2  A: 

I don't think anyone has talked about Python so I'll have a go

Can I expect an array to be an adjacent block of memory (yes)?

In python arrays are more like arrays of pointers in C. So the pointers will be adjacent, but the actual objects are unlikely to be.

Are two integers in the same instance closer than two in different instances of the same class (probably)?

Probably not for the same reason as above. The instance will only hold pointers to the objects which are the actual integers. Python doesn't have native int (like Java), only boxed Int (in Java-speak).

Does an object occupy a contigous region in memory (no)?

Probably not. However if you use the __slots__ optimisation then some parts of it will be contiguous!

What's the difference between an array of objects with only two int fields and a single object with two int[] fields? (this example is probably Java specific)

In python, in terms of memory locality, they are both pretty much the same! One will make an array of pointers to objects which will in turn contain two pointers to ints, the other will make two arrays of pointers to integers.

Nick Craig-Wood
Great, thanks! I think it would be cool to have a little information about several VM languages accumulate here.
Hanno Fietz
I think this is a bit misleading as Python doesn't have arrays natively, just lists, which would be an array of pointers to PyObjects as stated. However it also provides an array module (http://docs.python.org/library/array.html) and a struct module (http://docs.python.org/library/struct.html) each of which would give you more closely and predictably packed data.
Kylotan
@Kylotan you are right of course - I was attempting to translate the somewhat Java oriented nature of the question into normal Python usage. The array module would be worth a look most definitely if you wanted to get better memory locality.
Nick Craig-Wood