views:

1300

answers:

7

I am currently developing a proof of concept for an alternative data store. The reason why is I need to enhance a read-mostly clustered webapp, but also because I want to free myself from the pain of the sometimes overly-complex ORM+RDBMS solution.

Overall the idea is quite similar to a distributed cache with persistence (letting the cluster be the SoR), however:

  • want to be able to retrieve any object along with its children, by id (providing class & id) [only that to start off, as the main querying part is already resolved with lucene in my app].
  • need to have map of maps of types ( ~ tables in the relational world), and therein distributed maps of 'dehydrated' stored objects (flattening the object graph via reflection deep cloning)
  • a bin log (like Prevayler, for example) for
    • eventual recovery if whole cluster goes down
    • development (and ability to refactor code / change structure)
    • perhaps asynchronously processed for other purposes (reporting, whatever)
  • eventually later on try to integrate a statically-typed query mechanism, like LINQ, Jaque or H2's JaQu / see ODBs / Lucene (?)
  • it has to be transaction-aware (not sure "JTA type" though)

I'm planning to implement this idea with Hazelcast (I love its super-simple API) or Terracotta (which I never used - but I'm aware of their 'sweet spot', medium-term data). If you will, my aim is to do more or less what Jonas once blogged about. Using one of these, stored data would roughly have to fit in the sum of the JVM heaps of the cluster.

This should be pretty simple to scale, would avoid the relational impedance mismatch (ie save as with an ODB) and JDBC + I/O overhead.

Do you know of other tools/frameworks or combination thereof already providing similar functionality, that I'm ignoring? Can you suggest other ways of tackling this 'getting rid of the DB'? What flaws do you already see in this idea? Concurrency-wise would it make sense to consider Scala instead of Java?

How about non-relational data stores such as Couch DB, Neo4j, HyperTable, HBase?

A similar question was asked one month ago - but there was no concrete solution.

BTW I just stumbled upon the concept of Enterprise Data Fabric, which, to my surprise, describes a lot of these ideas.

+2  A: 

Definitely give Terracotta a try. It's free (unless you go Enterprise which has an SLA and support). It is a JVM-level cluster, so to speak, so you don't have the issues associated with sessions on multiple boxes behind disparate JK workers (assuming you're using this for a J2EE app).

I'm just rambling, so have a look here: http://en.wikipedia.org/wiki/Terracotta_Cluster

UPDATE numerous bits of info on Terracotta on the web too, e.g. http://blog.terracottatech.com/2007/12/fud_of_the_week_terracotta_doe.html

UPDATE2 Bit of background on my views: I work for a company with a fairly big audience. We have a enterprise MySQL running with a master and about 5 slaves (times 2 considering we have 2 channels, with 4 app servers per channel), using MySQL's JDBC Replication driver (for which we've already submitted various patches). We use Spring2.5/Hibernate3 using Spring's declarative JTA transaction management, so read-onlies go to the slaves. With the advent of numerous Ajax enhancements on a future version of our site, our DB servers' load has gone up - we create pricing summaries for thousands of products for all countries, taking into account duties/tax rules for all these countries (plus promotions and real-time auctions running all the time), then the Ajax services have the latest prices in a blink. Terracotta takes the load off the DB and app servers by making these prices available to all app servers on a JVM-layer, with all the JVMs across the boxes linked. So, server A can update the prices every few minutes, and if Ajax hits server B, the prices are available immediately. I know there are people/companies out there with similar businesses, who probably have better ideas and implementations, so I'm always open for discussion, but this is my two cents.

I get inspiration from the guys at Facebook too, for instance this very informative article: http://www.facebook.com/note.php?note_id=23844338919

They talk about memcached which you should also definitely check out.

opyate
+1  A: 

Interesting.

I have a view that we all develop a zoo which comprises all the abstraction layers we habitually use in our projects. And each abstraction layer is a completely different animal.

My goal is to minimize the amount of time spent on just care and feeding of the animals whenever it diverts me from solving the problem at hand - it's overhead - wasted resources. So the fewer, simpler abstraction layers we can get away with, the more productive we are.

I can usually do just fine with two beasties - OOP and RDBMS, coupled through nice, simple, minimal, hand-crafted DAL. For me, ORM is mostly overhead - one abstraction too many, and a pretty hungry one.

Don't discount the option of treating stored procedures as an abstraction tool, either. If you're real comfortable with SQL, it can be a useful resource for implementing a light-weight BL facade that means not needing to think about the ORM problem.

And this post suggests the emergence of alternatives to RDBMS for some requirements, anyway.

le dorfier
A: 

Thanks for your answers.

Actually, you talk about DBs which is something I want to completely take out of the picture.

The use case I'm targetting is a startup's small/medium-sized clustered webapp (boxes in a LAN, or in the cloud). It needs to retrieve objects at ~RAM-speed levels and scale fairly easily. As a side-effect, one wouldn't have to think about DB server installations, impedance mismatch, JDBC, caches, polluting domain models with annotations, etc.

Again, what I want to accomplish is something like described here, and I would love to have some more feedback on ideas concerning the actual implementation (why use Terracotta instead of Hazelcast, use serialization or deep cloning via reflection or whatever else, and also the major drawbacks of an approach like this - eg. why wouldn't you change it for your current ORM/DB setup).

It has to be super simple to integrate so it'll feature a really neat Java API, improving code readability. No other software (DB, memcached will be required).

frank06
@frank06: Your preconceptions about databases are incorrect. On a well spec'ed, well designed and well tuned Database, almost all accesses are from 'hot' pages held in RAM not from Disk.
Mitch Wheat
@Mitch Wheat: preconceptions? just think "out-of-the-DB-box" for a while, that's what the post's about. moreover what you state is incorrect, otherwise caches and products like terracotta would be pointless.
frank06
A: 

Try GigaSpaces. I think they have exactly what you require, and if I'm not mistaken there's a free version for startups.

Some concepts:

  • "Space" is some place where you can store and retrieve objects
  • Space can be backed by any JDBC-compliant DB, automatically (no code, only configuration)
  • Space can be started in your java process, so all accesses are at RAM speed
  • Space can be clustered/partitioned in any way you want (full mirror, partial, grid).
  • Space supports distributed or local transactions

Check their wiki, (but only "programmer's guide" - all the rest is marketing BS).

Yoni Roit
+2  A: 

As Neo4j is mentioned in the question, I'm chiming in with a few thoughts on using a graph database in this case. (I'm part of the Neo4j team)

  • retrieving children is trivial in a graph db
  • there is a map implementation for neo4j
  • as graphs are native to a graph db you could consider not to flatten the object graph, but to persist data in nodes and edges/relationships (this gives you more flexibility in handling the data)
  • neo4j is fully transactional

With the new DB technologies emerging today, there's really no need to stay with a RDBMS if your data isn't a good fit for the relational paradigm.

nawroth
neo4j provides indexing through lucene for super quick queries
+1  A: 

Seems to me Terracotta is a perfect fit for your requirements:

  • cluster a map to retrieve children via keys (e.g. clustered Map)
  • map of maps - no problem
  • no explicit bin log - but Terracotta already persists everything to disk so full cluster restart is already supported
  • integrated already to Compass, Hibernate Search, and Lucene for search
  • Transactions? Too slow. Use the cache as a datastore. With persistence you won't lose data writing to (clustered) memory and trickle back to the DB.

In addition, Terracotta does the "reflection" thing you ask for - although it doesn't use reflection as that is far too slow. It uses BCM. Only changes are propagated on the network.

Hazelcast btw requires serialization so it will be slow and will not do well at all with a map of maps data structure (every put will result in a full deep clone copy across the network) and it doesn't have any kind of persistence built in.

Taylor Gautier
A: 

I have to admit I am VERY confused. the original post describes Terracotta exactly. It is OO (no ORM req'd). It is writing everything to disk. It is transparent and simultaneously has several APIs / TIMs to do whatever you might want (write behind to DB, master/worker, what have you).

In fact, the blogs quoted from willcode4beer and Jonas Boner are both suggesting you should just use Terracotta.