views:

29

answers:

2

background:

I have 'inherited' a php webapp in my small company and after years of nagging have finally gotten the go to throw away the spaghetti code and start again.

We want to log every action that is made in the system for example:

  • user X viewed item Y
  • user X updated item Y
  • new item Y on city Z

and later provide graphs on different resolutions ( day, month , year ) of the actions done in the system.

On the previous version we have a table with 20,000,000 of records since 2005 so this would give you an idea of the ammount of data we already have and this is only for one of many statistics.

actual question:

What recomendations you have on building a near to realtime system to create this statistics?

notes:

  1. Graphing is already covered via google's visualization api
  2. Im not adversed of using any NoSql databases or messaging servers, crons or whatever that gets the job done, but would prefer mysql/php solution
  3. My current train of thought is automatically create a table for each statistic I want to save, and create several aggregation tables ( by month , by day , by year ) to cache the results.
  4. I know is a broad question but any suggestions are welcomed
+1  A: 

If all users must be registered I would go with a complete normalized solution.

USERS TABLE            OBJECTS TABLE
---------------        -----------------     
user_id (primary)      object_id (primary)


USERS_TO_OBJECTS TABLE
--------------------
user_id (index)
object_id (index)
time (index)
action (index)
? object_type (index) // could be useful to speed things up

This setup will probably give you the best flexibility while charting and will also be pretty fast as you can leave out the user or the object if you don't need them.

Edit:

Say city X (id 9876) was updated by user 123 (id 1234)...

1234    - user_id (the user that did the action)
9876    - object_id (the object where the action was done)
xyz     - time
updated - action type (so that you select only specific actions)
city    - object type (so that you select only specific objects)

I've feed up this table with 40M rows and the results are pretty acceptable.

0.002 secs for a simple COUNT on the number of UPDATED cities in the last WEEK. Data was inserted randomly.

Edit 2

If you find yourself with a really huge table you can resort to MySQL partitions and your schema is perfect. I really don't know how you're gonna use the tables but you could:

PARTITION BY RANGE. Organize the partition on dates. Every new month or so you'd have a new partition.

PARTITION BY KEY. Organize the partition by actions. Every action goes to it's proper partition.

You can check out more on partitions on MySQL's site and this article gives you some detail into fine-tunning partitions.

Frankie
Thanks @Frankie, I'm expecting a lot of rows, storing all statistics in the same table would make it huge and slow. Plus some statistics have extra information for example new item Y on city Z its not always user_id.
Mon Villalon
@Mon Villalon, I've run some tests and updated the answer accordingly. If all queries resort to good indexes 40M rows should be no problem at all. You may, nevertheless, resort to partitions if you find the database struggling.
Frankie
A: 

You can consider a "database" like Redis. Redis uses linked lists to store list types, so you could easily just LPUSH to a list like so:

$action = array(
    "type"=>"page_view",
    "url"=>"/example/path",
    "user"=>"user1",
    "timestamp"=>$_SERVER["REQUEST_TIME"]
);
$r = new Redis();
// Redis connection info //
$r->lPush("global_history", json_encode($action));

The LPUSH operation runs at O(1), so that shouldn't be an issue. Retrieving results might be a little trickier (O(n) time for lists). Redis is also largely memory-based (though dev builds include "virtual memory" support), so it's fast, but you could run out of room rather easily.

I use this technique to record history and statistics for a music site I run. It's very successful and provides very fast results with very little effort.

If you want to make it more solid, you could use Hadoop or another technology to pull from the end of the list (RPOP) and archive the results into a more permanent format (XML pushed to Amazon S3, for instance). You could then have your Google Visualizations look for their data from the archive (which would be static, compiled data) when viewing older data and the Redis front-end for more recent data.

Hope this helps!

mattbasta