views:

86

answers:

2

For the sake of simplicity, let's assume I'm cloning twitter (I'm not). So every user can follow other users, and be followed by other users. For each user you follow, you receive all tweets he sends. Everything is stored in a data storage (be it a NoSQL solution or a sharded relational database).

However, when users are online, do you think it is appropriate to have them receive tweets via JMS, rather than polling the database and retrieving new tweets:

  • when a user registers (or when he logs-in), a JMS Topic is created, named after him (or his id)
  • when a user logs-in he subscribes to the JMS Topic of each of the users he follows
  • a session-scoped object (per-user) acts as a JMS message-listener
  • all received messages are stored in the session (in-memory)
  • the UI is updated via ajax polling of the session-scoped object
  • when the user logs-out, or his session times-out, the message-listener is destroyed

The idea behind this is allegedly to boost performance - i.e. not to query the datastore too often, but rather to cache immediate things in memory.

The whole thing is of course expected to run in a cluster, and be scalable.

However I'm not sure:

  • whether this is actually worth it (in terms of performance and scalability gains)
  • whether JMS does not add an undesirable overhead, which is equal to querying the datastore (and hence making the whole complication useless)

At some point (when the thing is functional) I will make some benchmarks, but I'd like to hear some initial remarks.

+3  A: 

Sounds reasonable enough. You'd need to be sure that your JMS implementation of choice supports a potentially very large number of topics - not all of them can do that elegantly.

My main design question would be that when a user first logs in, his session store of messages would be empty, and you'd have to wait for it to fill up. Wouldn't you then have to hit the database anyway, or would this not be an issue.

Also, you're not really making use of the event-driven nature of JMS here. Messages received from the topic are just dumped into the session store for later retrieval.

Since it's not really event-driven, you could perhaps consider a distributed in-memory data-store instead, such as EhCache+JGroups, or JBossCache3 (which I can highly recommend). New tweets would be dropped into this distributed store, and readers would just need to trawl over that looking for items of interest. This could be more memory-efficient, since only one copy of each tweet would be stored on each node. You could also pre-load the cache at system start-up.

skaffman
yes, on login the database is hit once, but that's it for the whole session (unless of course you want to retrieve older 'tweets'). The distributed cache suggestion is very good - instead of polling the JMS provider, I'll be polling the cache (same cpu load), but the message will exist once per node. And actually, there will be only one polling - the ajax, and each request will look in the cache (rather than ajax pools pojo, pojo polls jms)
Bozho
@Bozho: It might be worth trying to decouple session activity from the database entirely, if you want it to scale well.
skaffman
+1  A: 

Note: I don't have practical experience with the system setup you described in your question, so the following are really theoretical considerations.

One factor is the question, what your users will see, when they log in the next time:

  1. All tweets that haven't been delivered to the user in his previous sessions. Or
  2. All tweets that have been posted in the previous x hours.

Case 1: JMS is nice, because a queue can remember, which messages have already been delivered. But wait: This means, you'd have to have one queue per message receiver.

Case 2: Here you can really work with topics per message sender, and expire messages that are older than x hours. So again, JMS may be a good option.

Performance

JMS implementations can usually keep the messages in-memory, and to access the messages in a queue/topic, you don't have to search in a large index - so I assume, that this should be faster than a database, probably still faster than an in-memory database. When a node fails, you can either re-create the queues from the backing database, or you could use high-availability and persistence built into some JMS implementations.

But I fully agree with skaffman: Compared to a distributed in-memory data-store, you'll be using more memory. The advantage of JMS is, that it simplifies automated expiry of messages (and some other things), and I don't know if it's a good idea to re-implement that functionality.

So maybe what I would do, is just saving IDs in the queues, and holding the actual messages in a Java object cache. This way, you'll once again have to use an index, but you get the convenience from JMS with most of the memory efficiency of the object cache. When the cache is only on the sender's side, it doesn't even have to be distributed, assuming that all messages from one sender reside on one (replicated) node - but that depends on probably a lot of other architectural decisions.

Chris Lercher