views:

58

answers:

2

I am working on a project that involves gps data collection from many users (say 1000) every second (while they move). I am planning on using a dedicated database instance on EC2 with the mysql persistent block storage and run a ruby on rails application with nginx frontend. I haven't worked on such data collection application before. Am I missing something here?

I will have a another instance which will act as application server and use the data from the same EBS. If anybody has dealt with such a system before, Any advise would be much appreciated?

A: 

You should use PostgreSQL for this. Postgres has better support for spatial data types (point, line, plane, etc.). Also it has functions for handling and calculations of different spatial data types as well as indexing of such data. You may want to use GeoKit gem for ruby on rails for various operations on ActiveRecord level.

And I agree with webdestroya - every second?

Eimantas
I am actually not using any fancy feature with regards to spatial. I will use this data to some computations display route using google maps. Besides we have been using mysql in the past. so legacy is at play here too.
gvaswani
+1  A: 

I would be most worried about MySQL and the disk being your bottleneck. I'm going to assume you're already familiar with the Ruby/Rails trade-off of always needing to throw more hardware at the application layer in return for higher programmer productivity. However, you're going to need to scale MySQL for writes, and that can be a tricky proposition if you're actually talking about more than 1000 QPS (1000 users, writing once a second). I would recommend taking whatever configuration of MySQL you're planning on using and throwing a serious amount of write traffic at it. If it falls over at anything under, say, 3000 QPS (always give yourself breathing room for spikes), you're going to need to either revise your plan (data every second? really?) or write to something like memcache first and use scheduled tasks to write to the database in one go (MySQL 3.22.5 and later supports multiple inserts in a single query, and there's also the LOAD DATA INFILE method, which can be used in conjunction with /dev/shm). You can also look into delayed insertion if you're not using InnoDB.

I'm biased of course (I work for Google), but I would be using App Engine for this. We run stuff that gets way more write traffic than this all the time on App Engine and it works great. It scales out of the box, there's no need to start up new images, and you don't have to deal with the issues of scaling SQL-based persistence. Also you get a ton of free quota to work with before billing starts. You can run JRuby if you really want a Ruby environment, or you can opt for Python, which is a bit better supported. Deployment is also much easier for something like this, even if you're using Vlad or Capistrano with EC2.

Edit: Here's a very conservative estimate of your data growth. 16 bytes is just the minimum required to store a lat/lon coordinate pair (two doubles). In the real world you have indexes and other database overhead that will increase this number. Adjust the formula accordingly based on real data to figure out how quickly you'll hit the 150GB limits.

Bob Aman
Thanks Bob for the response. One thing we are considering is to save the data locally on the smart phone and send it in batches (every 2 or 5 mins). does that alter any scenarios you described?I have used google app engine before and would have definitely used it if we were not too far in the project. Besides we are more rails expertise than python django.
gvaswani
Well, in truth, I would probably not be using Django for handling this kind of throughput. I'd just write a request handler and process everything bare-metal. If you're going to go with Rails, you should do the equivalent and handle all of these incoming requests with the Rails Metal Rack middleware. The rest of your app you should be able to do with normal Rails request logic.
Bob Aman
Yes, sending the data in batches would be a much better approach. Even once a minute would be a vast improvement.
Bob Aman