views:

166

answers:

4

Hi All,

I am building an content serving application composing of a cluster of two types of node, ContentServers and ContentBuilders.

The idea is to always serve fresh content. Content is fresh if it was built recently, i.e. Content.buildTime < MAX_AGE.

Requirements:

*ContentServers will only have to lookup content and serve it up (e.g. from a distributed cache or similar), no waiting for anything to be built except on first request for each item of Content.

*ContentBuilders should be load balanced, should rebuild Content just before it expires, should only build content that is actually being requested. The built content should be quickly retrievable by all ContentServers

What architecture should I use? I'm currently thinking a distributed cache (EhCache maybe) to hold the built content and a messaging queue (JMS/ActiveMQ maybe) to relay the Content requests to builders though I would consider any other options/suggestions. How can I be sure that the ContentBuilders will not build the same thing at the same time and will only build content when it nears expiry?

Thanks.

A: 

If the content building can be parallelized (builder 1 does 1..1000, builder 2 does 1001..2000) then you could create a configuration file to pass this information. A ContentBuilder will be responsible for monitoring its area for expiration.

If this is not possible, then you need some sort of manager to orchestrate the content building. This manager can also play the role of the load balancer.The manager can be bundled together with a ContentBuilder or be a node of it's own.

I think that the ideas of the distributed cache and the JMS messaging are good ones.

kgiannakakis
+2  A: 

Honestly I would rethink your approach and I'll tell you why.

I've done a lot of work on distributed high-volume systems (financial transactions specifically) and your solution--if the volume is sufficiently high (and I'll assume it is or you wouldn't be contemplating a clustered solution; you can get an awful lot of power out of one off-the-shelf box these days)--then you will kill yourself with remote calls (ie calls for data from another node).

I will speak about Tangosol/Oracle Coherence here because it's what I've got the most experience with, although Terracotta will support some or most of these features and is free.

In Coherence terms what you have is a partitioned cache where if you have n nodes, each node possesses 1/n of the total data. Typically you have redundancy of at least one level and that redundancy is spread as evenly as possible so each of the other n-1 nodes possesses 1/n-1 of the backup nodes.

The idea in such a solution is to try and make sure as many of the cache hits as possible are local (to the same cluster node). Also with partitioned caches in particular, writes are relatively espensive (and get more expensive with the more backup nodes you have for each cache entry)--although write-behind caching can minimize this--and reads are fairly cheap (which is what you want out of your requirements).

So your solution is going to ensure that every cache hit will be to a remote node.

Also consider that generating content is undoubtedly much more expensive than serving it, which I'll assume is why you came up with this idea because then you can have more content generators than servers. It's the more tiered approach and one I'd characterize as horizontal slicing.

You will achieve much better scalability if you can vertically slice your application. By that I mean that each node is responsible for storing, generating and serving a subset of all the content. This effectively eliminates internode communication (excluding backups) and allows you to adjust the solution by simply giving each node a different sized subset of the content.

Ideally, whatever scheme you choose for partitioning your data should be reproducible by your Web server so it knows exactly which node to hit for the relevant data.

Now you might have other reasons for doing it the way you're proposing but I can only answer this in the context of available information.

I'll also point you to a summary of grid/cluster technologies for Java I wrote in response to another question.

cletus
A: 

It sounds like you need some form of distributed cache, distributed locking and messaging.

Terracotta gives you all three - a distributed cache, distributed locking and messaging, and your programming model is just Java (no JMS required).

I wrote a blog about how to ensure that a cache only ever populates its contents once and only once here: What is a memoizer and why you should care about it.

I am in agreement with Cletus - if you need high performance you will need to consider partitioning however unlike most solutions, Terracotta will work just fine without partitioning until you need it, and then when you apply partitioning it will just divy up the work according to your partitioning algorithm.

Taylor Gautier
A: 

You may want to try Hazelcast. It is open source, peer2peer, distributed/partitioned map and queue with eviction support. Import one single jar, you are good to go! Super simple.

Talip Ozturk