views:

618

answers:

6

I have an application which utilizes graph (tree-like) custom structures. The structures are not real trees, but pretty much everything is connected together. The quantity of the data is also big (millions of nodes can exist). Tree-nodes can vary in type to make it more interesting (inheritance). I don't want to alter the data-structures to accommodate the persistence storage.

I want to persist this data without too much extra work. I've goggled some options to solve this problem, but couldn't find anything that fits exactly for my needs. Possible options: serialization, databases with ORM (Hibernate?), JCR (JackRabbit?), anything else?

Performance is important, because it's a GUI based "real-time" application (no batch processing) and there could be millions of graph-nodes which should be read and written between the memory and the persisted data store.

Does anybody have experience or ideas about storing these kind of data?

+1  A: 

Since you indicate that there is a large quantity of data, you probably want a mechanism that you can easily bring the data in as needed. Serialization is probably not very easy to handle with large quantities of data. In order to break it up into manageable pieces you would need to either use separate files on disk or store them elsewhere. JCR (JackRabbit) is more of a content management system. Those work well for 'document' type objects. It sounds like the individual pieces of the tree you want to store may be small but together they can be large. That is not idea of a CMS.

The other option you mention, ORM, is probably your best option here. The JPA (Java Persistence API) is great for doing ORM in Java. You can write to the JPA spec and use Hibernate, Eclipselink or any other flavor of the month provider. Those will work with whatever database you want. http://java.sun.com/javaee/5/docs/api/index.html?javax/persistence/package-summary.html

The other benefit to JPA is that you can use the lazy FetchType for loading tree dependencies. This way your application only needs to load the current set of pieces it is working on. As other things are needed, the JPA layer can retrieve them from the database as needed.

Chris Dail
A: 

An ORM, for example using JPA api ( Hibernate, EclipseLink, ... ) will probably make it very quick to implement persistence. Raw performance of the whole tree persistence tends to be tricky to achieve compared to plain JDBC. So if your only performance criteria is persisting the whole tree in one shot, that is probably not the best option.
On the other hand if you also need to load the tree, synchronize changes of the tree, then JPA offer those feature built-in with (after a bit of tweaking) better performance than many manual implementation.

Serialization in java tends to be quite slow and produce loads of data. Serialization is also quite brittle when you change class in your app and is completely useless if you need to synchronize tree changes.

In the same category as serialization you can serialize in XML and persist it in some XML database (Oracle XDB). However those are designed more for flexibility of storage/querying than raw speed.

If time is not a concern the very best way is always to involve a competent DBA and design an optimal datamodel and refactor the tree accordingly.

vdr
+1  A: 

I have nearly the exact problem and used hibernate. We ran into a lot of problems late in the project because the view basically forced the entire graph into memory even with using lazy fetch types. These tools were good early on though because we could quickly get a DB tier in place that gave us something (huzzah agile). Only when we were going for performance improvements did we realize we needed to write a more intelligent persistence layer.

Is it possible to do some pre-processing on your data? If the problem is similar there is a lot of value in trying to transform the data into a intermediate form that is closer to your view than the original domain and store this in the DB as well. You can always link back to the original source using the lazy fetch type.

Basically we used a 4-tier system: Domain DB, ViewModel-DB hybrid (pre-processed layer), ViewModel, View

The advantage of this pre-processing step (especially with realtime UI), is that you can page data into a ViewModel and render it nicely. So much of performance in a realtime app is slight of hand, just stay responsive and show them something nice while they wait. In our case we could show 3d box regions of data that was paging in, data that was linked to loading data could show a visual indicator as well. The ViewModel-DB hybrid could also do nice things like LRU queues that fit our domain data. The biggest advantage though was to remove the direct linking. Nodes had something similar to URL to their linked data. When rendering we could render the link, or render that there is a link we just are paging it in at the moment.

Persistence at the DB level was JPA (Hibernate) to start, but in the end the tables it generated for our inheritance structure were terrible and hard to maintain. In the end we wanted more control over tables than JPA allowed (or at least easily allowed). This was a tough decision as JPA did make a lot of the DB layer easy. Since JPA kept things nice and POJO it didn't require mucking around with our datatypes. So this was nice.

I hope there is something you can pull out of this meandering answer, and good luck :)

reccles
+3  A: 

As your data uses a graph data structure (basically: nodes and edges/relationships), a graph database would be a very good match. See my answer on The Next-gen Databases for some links. I'm part of the Neo4j open source graph database project, see this thread for some discussion of it. A big advantage of using Neo4j in a case like yours is that there's no trouble keeping track of persisting/activating objects or activation depth and the like. You probably wouldn't need to change the data structures in your application, but of course some extra code would be needed. The Design guide gives one example of how your code could interact with the database.

nawroth
A: 

consider storing your nodes in a database, a suitable schema might be:

t1(node_id,child_id)
t2(node_id,data1,data2,..,datan)

then use JDBC to access/modify the data. if you use proper indexes, it will perform rather well up to scales around 100 million records. my gut feeling is to avoid generic object serialization if performance is really important because you lose some control over the performance characteristics of the code with those solutions.

if you need better performance, you can use a memcached layer.

Omry
A: 

I believe the solution to your problem is to use Terracotta as your persistent storage mechanism. I encourage you to read this excellent article about doing so.

It addresses your two main concerns: performance and transparency. It easily scales up to large graphs, while maintaining high performance, because of its efficient sync mechanism which only sends instance diffs across the network. It also persists your graph transparently because it works on the VM level, absolving you of the impedance mismatch problem you would face with the alternatives mentioned in other answers (ORM or OCM).

To be clear, Terracotta is not a persistence solution for every case. It's best employed when you need data available across machine reboots and you need it quickly. It's not a good solution when you need that data "archived", for example having requirements to access that data long after the running system has stopped working with it. Think about orders coming into a web store. You probably want to store these orders for years after they've been fulfilled. In these cases you can look at a hybrid approach, where select data needing to be archived can be pulled out of the Terracotta cluster and stored using a traditional RDBMS.

For a more complete review of the pros & cons, be sure to read this StackOverflow post which covers more of the minutiae in making the choice.

rcampbell