tags:

views:

65

answers:

2

Hello,

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.

I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:

Website (document)
 - some keys/values about the particular document
 - statistics (tree)
   - millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)

What got me excited about mongodb was the grouping functions such as: http://www.mongodb.org/display/DOCS/Aggregation

db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );

But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?

Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.

Thanks!

+3  A: 

Every document has a size limit of 4MB (which in text is A LOT).

It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.

There is no limit of documents other than your disk space in mongodb.

You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.

In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:

db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results

Remember to turn off profiling once you're finished (log will get pretty huge otherwise).

Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:

website (collection): - some keys/values about the particular document

statistics (collection) - millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc) + DBRef to website

See Database References

halfdan
this is great, thanks! If I use the collection for statistics, is there still a 4MB limit? I'm sure it may be possible to use that group command for example on multiple collections however for simplicity sake I'd rather have all raw records stored inside one "table".
Joe
The 4MB limit is per document, the collection itself can contain as many documents as your disk can hold. Your statistics will grow rapidly and if they are stored inside a document you'll probably reach the 4MB limit very soon.
halfdan
"the collection itself can contain as many documents as your disk can hold." With sharding, you can even go beyond that :-)
Thilo
thanks a lot! going to try this out.
Joe
"eventual consistency" doesn't mean that, mongo is not eventually consistent like cassandra or simpledb, it's strongly consistent like a rdbms. there is no transaction log in mongo, so it can loose data on a power failture if there is no replication, this is called loosing data as is. "eventual consistency" means, you may get the old value of a record after update from some nodes in some conditions for a short time period.
sirmak
@sirmak: +1. Good catch. What halfdan is talking about is called "single-server durability", which is a target for the next release.
Thilo
@sirmak: You're right! Thanks for the catch, gonna update the post.
halfdan
+2  A: 

Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.

Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.

The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.

The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.

Niels van der Rest
+1 and map reduce have some hard limitations
sirmak