views:

54

answers:

2

Hi, I'm planning to log my squid instances to a mongodb, but the actual problem is that we have a huge traffic to be logged, every access authenticated with user/pass. Eventually we have to make some reports based on logs. I was thinking to insert the logs distributed by months and by users, so my collection will look like this:

{month: 'april', users: [{user: 'loop0', logs: [{timestamp: 12345678.9, url: 'http://stackoverflow.com/question/ask', ... }]}]

So if I want to generate my reports based on the month of april I just have to get the right month instead of looking in zillions of lines to fetch the lines that timestamp match between April, 1 and April, 30.

Of course this type of insert will be slower than just insert the log line directly. So my question is: is there a best way to do this?

Nowadays we have around 12 million lines of log by day.

A: 

It's hard to tell without knowing the details, but I'd say that it's likely you're worrying about the wrong problem: you're thinking about insert speed rather than the report calculation speed.

Mongo has all day to store those 12 million entries, but you may want the report - spanning maybe half a billion entries (~= 1 month worth of data) - to render in real time (seconds maybe a minute). From that perspective, it's probably advisable to optimize for reading, rather than for writing.

Tomislav Nakic-Alfirevic
A: 

You could also create a new collection every month. Or store the data twice. Disk space is cheap.

Theo