tags:

views:

856

answers:

10

I build a huge graph in JVM (Scala) which I want to use repeatedly, tweaking algorithms. I'd rather not reload it each time from disk. Is there a way to have it sit in one JVM while connecting from another, where the algorithms are being developed?

A: 

Using RMI perhaps? Have one instance working as server and the rest as clients?

I think it would be much more complicated than reloading from disk.

OscarRyz
A: 

You can certainly create an interface onto it and expose it via (say) RMI.

My initial thoughts on reading your post, however, are

  1. just how big is this graph ?
  2. is it possible to optimise your loading procedure instead ?

I know LinkedIn have a vast graph of people and connections that is held in memory all the time and that takes several hours to reload. But I figure that's a truly exceptional case.

Brian Agnew
+7  A: 

Save your graph to disk, then map it into memory with MappedByteBuffer. Both processes should use the same memory, which will be shared with the page cache.

bdonlan
but then I don't really need two JVMs, right? I have the graph serialized, yet loading from disk and deserializing is about 5-7 minutes; although I suspect it's help in Linux caches anyways. So how should one manage the memory to be shared by two processes here?
Alexy
Actually, I serialize the graph to load it from disk, so I wonder how should that serialization interact with the MappedByteBuffer?
Alexy
You would need to write it out in a format where you can reasonably use it directly out of the buffer - ie, without deserializing.
bdonlan
A: 

If is expensive to build your graph maybe you can serialize the object.

ByteArrayOutputStream bos = new ByteArrayOutputStream();
     ObjectOutputStream out = new ObjectOutputStream(bos);
     out.writeObject(graph);
     out.flush();
     byte b[] = bos.toByteArray();
//you can use FileOutputStream instead of a ByteArrayOutputStream

Then you can build your object from the file

ByteArrayInputStream inputBuffer = new ByteArrayInputStream(b);
     ObjectInputStream inputStream = new ObjectInputStream(inputBuffer);
     try {
      Graph graph = (Graph) inputStream.readObject();

     } finally {
      if (inputStream != null) {
       inputStream.close();
      }
     }

Just replace the ByteArrayInputStream with a FileInputStream

Dani Cricco
I serialize the graph already, but deserializing it takes 5-7 minutes.
Alexy
+3  A: 

Two JVMs sounds more complicated than it needs to be. Have you considered doing a kind of "hot deploy" setup, where your main program loads up the graph, displays the UI, and then asks for (or automatically looks for) a jar/class file to load that contains your actual algorithm code? That way your algorithm code would be running in the same jvm as your graph, but you wouldn't have to reload the graph just to reload a new algorithm implementation.

UPDATE to address OP's question in comment:

Here's how you could structure your code so that your algorithms would be swappable. It doesn't matter what the various algorithms do, so long as they are operating on the same input data. Just define an interface like the following, and have your graph algorithms implement it.

public interface GraphAlgorithm {
  public void doStuff(Map<whatever> myBigGraph)
}

If your algorithms are displaying results to some kind of widget, you could pass that in as well, or have doStuff() return some kind of results object.

Peter Recore
This is interesting. What I want to do with the graph is not that fixed though; it's a few million nodes/edges and I want to walk it, flow through it, etc. Now which APIs would I use to dynamically apply methods from a jar, and how flexible is it?
Alexy
You'd use the Java reflection API - basically, you can load an arbitrary JAR or set of JARs, find a class in it, instantiate it, and invoke methods on it (or invoke static methods without instantiation). It's a bit heavyweight to actually do the call, but you'll be spending all your time inside there so it shouldn't be a problem.
bdonlan
OK -- so how do I set up a procedure whereby a running app checks regularly whether there's a new jar with algorithms to be run, load it, and runs it?
Alexy
+5  A: 

Terracotta can help you with this. It allows you to share objects among several jvm instances.

Daniel Ribeiro
I've found terracotta to be unsuited using *deep* collections (e.g. a map of maps) due to the way it decides to swap values in and out of memory
oxbow_lakes
+1 for allowing VMs from different servers to participate, and Terracotta integration isn't too invasive.
Steve Reed
Interesting -- but indeed the graph is a Map of Maps.
Alexy
A: 

if the problem is just to dynamicly load and run your code without name clashes a custom class loader could be enough. for a new run just cache all class files in a new classloader.

A: 

Have you considered simply using a smaller amount of sample data for testing your algorithms?

Nick Lewis
that's doable too, but exploratory runs on all data is also preferable when the graph is readily available
Alexy
+1  A: 

Did you consider OSGi platform? It lives in a single JVM, but will allow you to upgrade bundles with algorithms without platform restart. Thus you may have a long-term running bundle with your huge data structures and short-term algorithm bundles taking access to the data.

Alexander Azarov
that's finally a good reason to take a look at OSGi
Alexy
A: 

Terracotta shares memory between many JVM instances so you can easily apply cluster to your system.

Firstthumb