views:

343

answers:

5

Hi,

I have a couple of sqlite dbs (i'd say about 15GBs), with about 1m rows in total - so not super big. I was looking at mongodb, and it looks pretty easy to work with, especially if I want to try and do some basic natural language processing on the documents which make up the databases.

I've never worked with Mongo in the past, no would have to learn from scratch (will be working in python). After googling around a bit, I came across a number of somewhat horrific stories about Mongodb re. reliability. Is this still a major problem ? In a crunch, I will of course retain the sqlite backups, but I'd rather not have to reconstruct my mongo databases constantly.

Just wondering what sort data corruption issues people have actually faced recently with Mongo ? Is this a big concern?

Thanks!

+2  A: 

Mongo does not have ACID properties, specifically durability. So you can face issues if the process does not shut down cleanly or the machine loses power. You are supposed to implement backups and redundancy to handle that.

Amala
+3  A: 

"MongoDB supports an automated sharding architecture, enabling horizontal scaling across multiple nodes." -source So you need to run multiple nodes for balancing and failover support. If you are wanting to run a single instance that won't fail if power is suddenly lost you need something that supports ACID like couchDB. That being said i've been using mongo at work for a month and it has not crashed on me however we are moving to a 6 node cluster soon.

Durability

The products take different approaches to durability. CouchDB is a "crash-only" design where the db can terminate at any time and remain consistent. MongoDB take a different approach to durability. On a machine crash, one then would run a repairDatabase() operation when starting up again (similar to MyISAM). MongoDB recommends using replication -- either LAN or WAN -- for true durability as a given server could permanently be dead. To summarize: CouchDB is better at durability when using a single server with no replication.

Quote from mongodb.org's official site.

gradbot
+5  A: 

Yes, durability is a big problem in mongo. You have to use replication sets in mongodb for durability (you need at least 2 machines), otherwise you can loose upto last 1 minute on a power fail for example. There is no single server durability in mongo, but it'll be developed for 1.7-1.8 as I know. After a crash you have to repair db manually and rapair operation may took hours if your data is large. There is no transaction or acid, so it's not suitable for an ecommerce or banking application.

You should not use development versions of mongo (odd versiond number like 1.3.x,1.5.x,1.7.x are development versions) and you prefer to use 64 bit operating systems. If you digg into disaster articles on the web about mongo, the source of the problem is these two ones in most cases.

CouchDB, Cassandra and postgresql all have strong durability (fsync is 10 milliseconds by default in cassandra and postgresql), so they all have single server durability.

If you need dead easy scalability, fault tolerance and load balancing; cassandra is the best, but with poor query options. Failing nodes may go away and come back after a period of time, no problem, system auto reapirs itself.

Regards

sirmak
Interesting - 64bit and the dev versions do seem to be comming up pretty often in the posts I have been looking at. I think mongo has made me curious enough that I'm going to try it, but make sure I have an ACID compliant db backup option too.
flyingcrab
Mongo is the most popular nosql solution today (one of the reason is it's agile marketing from 10gen). if you digg into their site references, mostly it's used for low value high performance data like analytics (eg: for reporting, error logging, page view counters, internal counters, etc..). There are also some (not so much) sites using it for it's all data.
sirmak
There are a lot of sites using it for real data these days. http://www.mongodb.org/display/DOCS/Production+Deployments has a lot of high-profile deployments; a few of the sites using it for "real data" include Sourceforge, Foursquare, Wordnik, and Business Insider.
Chris Heald
@Chris: Foursquare also use postgresql. Owners of the businness insider and 10gen is AlleyCorp. And wordnik is a dictionary.
sirmak
+1  A: 

I don't see the problem if you have the same data also in the sqlite backups. You can always refill your MongoDb databases. Refilling will only take a few minutes.

TTT
+1. and you will not have to do this "constantly", but only after a server power outage (or something else that caused mongod to crash within a minute after the last update operation). In your case, you don't have updates at all, do you?
Thilo
+2  A: 

As others have said, MongoDB does not have single-server durability right now. Fortunately, it's dead easy to set up multi-node replication. You can even set up a second machine in another data center and have data automatically replicated to it live!

If a write must succeed, you can cause Mongo to not return from an insert/update until that data has been replicated to n slaves. This ensures that you have at least n copies of the data. Replica sets allow you to add and remove nodes from your cluster on the fly without any significant work; just add a new node and it'll automatically sync a copy of the data. Remove a node and the cluster rebalances itself. It is very much designed to be used across multiple machines, with multiple nodes acting in parallel; this is it's preferred default setup, compared to something like MySQL, which expects one giant machine to do its work on, which you can then pair slaves against when you need to scale out. It's a different approach to data storage and scaling, but a very comfortable one if you take the time to understand its difference in assumptions, and how to build an architecture that capitalizes on its strengths.

Chris Heald
+1 for the detailed insight - seems like a different paradigm, very exciting!
flyingcrab