views:

24

answers:

1

Hi,

In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.

Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.

I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?

Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.

+1  A: 

Sounds like you need to analyse your specific requirements.

It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.

Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).

If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.

Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.

Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.

MarkR
Thanks for the answer. For that task I need performance over durability. However I'm afraid that even with high sync interval RDMBS will still spend some extra time handling transactions, constraints and so on. Here we'll go only 1 table with 4 column so I would like to get rid of ACID. Anyway I'll compare that solution to the others.
mlomnicki
You should split operations: receiving (special service) sending data to storing service, this service stores data to RDMBS and send signal to another service for processing purposes. As pipeline you can use MSMQ or something other durable solution.
dario-g