tags:

views:

829

answers:

8

I've been looking at MongoDB and I'm fascinated. It appears (although I have to be suspicious) that in exchange for organizing my database in a slightly different way, I get as much performance as I have CPUs and RAM for free? It seems elegant, and flexible, but I'm not trading that for fast like I am with Rails. So what's the catch? What does a relational database give me that I can't do as well or at all with Mongo? In other words, why (other than immaturity of existing NoSQL systems and resistence to change) doesn't the entire industry jump ship from MySQL?

As I understood it, as you scale, you get MySQL to feed Memcache. Now it appears I can start with something equally performant from the beginning.

I know I can't do transactions across relationships... when would this be a big deal?

I read http://teddziuba.com/2010/03/i-cant-wait-for-nosql-to-die.html but as I understand it, his argument is basically that real businesses which use real tools don't need to avoid SQL, so people who feel a need to ditch it are doing it wrong. But no "enterprise" has to deal with nearly as many concurrent users as Facebook or Google, so I don't really see his point. (Walmart has 1.8 million employees; Facebook has 300 million users).

I'm genuinely curious about this... I promise I'm not trolling.

+28  A: 

I am also a big fan of MongoDB. That having been said, it is absolutely not a wholesale replacement for RDBMS. Facebook has 300 million users but if some of your friends don't show up in the list one time, or one of the photo albums is missing on the occasional request, would you notice? Probably not. If your status update doesn't trickle down to all of your friends for a few minutes, does it matter? Hardly. If Wal-Mart's balance sheets are out of sync, would someone lose their head? Definitely.

NoSQL databases are great in "fuzzy" environments where relationships are not strict and data integrity can afford to be out of sync. RDBMS are still important when data sets are extremely complex and relational (hence the name), and they need to be kept pure.

The big push to NoSQL comes from the fact for the last 30 years, we have been using RDMBS systems for both scenarios. We now have a more appropriate tool for many situations. Some would argue most, in fact. But no one would argue all.

Rex M
I like this answer, thank you for the detail. I'm curious, though, everything I know about scaling a relational database says you do replication, which is, just like Mongo's replication, "eventually consistent". How does a relational database keep things in sync across slaves better than a NoSQL database?
@jacobbaer across instances, there is a delay. What I am talking about is across data sets within a single instance. With RDBMS I can update any number of data sets as a single transaction. With MongoDB if two commits come in that affect multiple data sets, I need a third operation to come back through when they are finished and re-validate the data to ensure it is all correct. And that needs to happen without interruption, e.g. no *other* operations. So in a high-traffic system like we're talking about, relationships could almost always be slightly off.
Rex M
@rex I believe you do get transaction support within collections, but I guess if you have data that need to be consistent that spans collections (rather than being embedded) you're in trouble. Is there a technical reason why Mongo can't implement that? I find it hard to believe leaving off that feature is the only thing that makes it so much faster.
A: 

If your data does not take advantage of relational algebra, nor do you need ACID guarantees, then you don't gain anything by using languages that cater exclusively for those uses.

Arafangion
Isn't the idea to design your schema in ways that don't require relational algebra? (Note that I learned this term 15 seconds ago, so my understanding of what you mean by that is pretty shallow). Although there are probably some problem domains where it can't be avoided...
@jacobbaer the idea is not to artificially develop relationships the application domain doesn't require, simply to make your data model fit into an RDBMS.
Rex M
@rex Right, but just going over some of my current projects, relationships represented as relationships in SQL worked just fine as embedded objects (especially with the way I wanted to query them). For example, one of my hobby projects is a Facebook app to compare schedules. I can HABTM users and class periods in MySQL, or I can store class periods as a list under users in Mongo. So the relationship isn't artificially created when I do it in SQL, but it also doesn't need to happen in Mongo. Are there situations where it's impossible to do something like this efficiently?
@jacobbaer I can't comment on the particulars of your application without being more familiar with it.
Rex M
Right, but I'm saying the particulars of my application demanded a relationship in SQL, but didn't need a relationship in Mongo, so I didn't have to give up ACID (which is maintained within collections).
+1  A: 

The big backlash against NoSQL is rooted in the mentality of many of the NoSQL advocates. Specifically, the attitude best summarized as "SQL is too hard, I shouldn't have to do it". I dislike NoSQL because it seems in many cases to be elevating ignorance.

I know I can't do transactions across relationships... when would this be a big deal?

More often than you might expect. There are a lot of things that can go wrong when you can't assume a consistent dataset.

Kalium
It's unfortunate the message is coming across that way. The *intention* is that SQL is indeed hard because it is a very specialized, high-precision language optimized for transactional, highly complex relational operations; yet for many applications, that level of complexity and precision isn't necessary. So they are quite right to say they shouldn't have to do it, *for that specific case*. Though again, it's very unfortunate they come across so negative/ignorant.
Rex M
Isn't it more "scaling existing implementations of SQL databses is too hard, I shouldn't have to do it"? "SQL is too hard, I shouldn't have to do it" seems to be more the argument for an ORM. Am I misunderstanding something?
@jacobbaer ORMs are not viable for many of the situations where RDBMS are the best match, precisely because they abstract away a lot of the power and precision SQL affords.
Rex M
Exactly what Rex M said. If you can use an ORM, you probably don't need the feature set/limitations of an RDBMS anyway.
Logan Capaldo
@rex Ah, okay. So to make sure I'm understanding this, writing an SQL query to do something efficiently is harder but preferable to writing a map/reduce function or similar to do it in Mongo, because taking the time to do it in SQL lets you keep better consistency, right?
@jacobbaer very nearly, only I'll go one step further - in SQL, if you use it even remotely properly, it's actually even difficult to get yourself in a situation where data *isn't* consistent.
Rex M
@logan would you say that's true even of something like Hibernate? Or just the more fluffy ORMs?
@Rex M - I'd argue that that's only true of the bad ones. SQLAlchemy, for instance, was very much designed to keep full access to all of SQL's capabilities; I've used it to dynamically build some _very_ complex queries (subqueries, UNION/INTERSECTs, etc), and it's arbitrarily extendable.
Charles Duffy
+2  A: 

I have used MongoDB, Redis (more than key-value pair supports list, set and sorted set), Tokyo Tyrant, Memcached and MySql & PostgreSQL.

The arguments between NoSQL DB And SQL based DB are completely baseless. You need to choose the appropriate model based on your use case.. If you need ACID compliances, go ahead with SQL DB like PostgreSQL, Oracle etc. You need high performance, but you less care about data, then you may consider noSQL DB. They are fundamentally different technologies. You can even use the combination of models. With NoSQL, you will be missing relationships, constraints and sometimes transaction.. In fact, thats is the one of the reason NoSQL are faster..

Once I have lost two months of aggregate data with MongoDB.. No clue how I lost them..But I had backup and I have lost few minutes of data. I brought back MongoDB with backup.. If you use NoSQL, take occasional backup or schedule cron jobs for DB backup. This is applicable for SQL DB also.

Compared to SQL RDBMS, NoSQL DBs are younger and they are currently under full fledged development but NoSQL DBs are matured in their scope ie they meant for high performance, easy replication.

In my website(stacked.in), I have used only redis DB, it works much much faster than MySQL.

Gopalakrishnan Subramani
+4  A: 

How often do you think Facebook does arbitrary queries against its datastore(s)? Not everything is a web app, and conversely not every set of data needs to be analyzed deeply.

NoSQL in my opinion, is largely a reactionary response to what basically amounted to people using RDBMS for tasks they were not well suited because people didn't actively make a decision based on their needs and chose some default. To "jump ship from MySQL" (or RDBMSs in general) industry-wide would be to make the same mistake all over again and the pendulum will end up swinging back the other way.

If MongoDB works for your use case, by all means go ahead. Just don't assume your use case is all use cases. There is no technology that fits all scenarios. The invention of the supersonic jets didn't eliminate the use of freight trains.

Logan Capaldo
+1  A: 

Remember, NoSQL isn't exactly new. After all, they had to use something before SQL and relational databases, right? In fact, systems like MUMPS and CODASYL work the same way and are decades old. What relational databases give you is the ability to query data in arbitrary ways.

Say you have a database with customers, their purchases, and what items they purchased. A NoSQL DB might have customers containing purchases and purchases containing items. This makes it easy to find out what items a given customer purchased, but hard to find out what customers purchased a given item. A relational DB would have tables for customers, purchases, items, and a table linking items to purchases. In SQL, both queries are trivial to formulate, and the database engine does all the hard work for you.

Also, keep in mind that part of the NoSQL trend is to sacrifice consistency or reliability for speed, scalability, and cost. Relational DBs can scale, but it's not cheap. If you go to http://tpc.org you can find RDBMSes that run on hundreds of cores simultaneously to deliver millions of transactions per minute, but they cost millions of dollars.

Gabe
+5  A: 

I write this but as a dispute to Rex's answer.

I dispute the idea that nosql is relationless and fuzzy.

I had been working with CODASYL many years ago with C and Cobol - entity relationships are very tight in CODASYL.

In contrast, relational database systems have a very liberal policy towards relationships. As long as you can identiy a foreign key, you could form a relationship adhoc.

It is frequently taken for granted that SQL is synonymous with RDBMS, but people have been writing SQL drivers for CODASYL, XML, inverted sets, etc.

RDBMS/SQL do not equal precision in data or relationship. In fact, RDBMS has been a constant cause in imprecision and misperception of relationships. I do not see how RDBMS offer better data and relationship integrity than hadoop, for example. Put on a layer of JDO - and we can construct a network of good and clean relationships between entities in hadoop.

However, I like working with SQL because it gives me the ability to script adhoc relationships, even though I realise that adhoc relationships is a constant cause of relationship adulteration and problems.

Having the opportunity to work with statistical analysis of business and industrial processes, SQL gave me the ability to explore relationships where no relationships had previously been perceived. The opportunity to work with statistical analysis gave me insights that would not normally come the way of SQL programmers.

For example, you would design and normalise your schema to reflect a set of processes. What you might not realise is that relationships change over time. The statistical characteristics would reveal that a schema may no longer be as "properly normalised" as it once had been. That the principal components of the processes have mutated over time. But non-statistical programmers do not understand that and continue to tout RDBMS as the perfect solution for data integrity and relationship precision.

However, in a relationship-linking database, you could link entities in relationships as they appear. When relationships mutate, the linking naturally mutate with the data. Relationships and their mutation are documented within the database system without the expensive need to renormalise the schema. At which point, RDBMS is good only as temp dbs.

But then you might counter that RDBMS too allows you to flexibly mutate your relationships, since that is what SQL does best. True, very true - so long as you perform BCNF or even 4NF. Otherwise, you would begin to see that your queries and data loaders performing replicated operations. But then your many years in the RDBMS business have so far certainly at least made you realise that BCNF is very expensive and operationally inefficient and that we are constantly guilty of 2.5 NFing our schemata.

To say that RDBMS and SQL promotes data and relationship integrity is a gross mis-statement. Either you work in a company that is so tiny or you didn't stay in your positions for more than two years - you would not see the amount of data or the information mutation and the problems caused by RDBMS. The abuse of RDBMS is the cause of executives being restricted in the view by computer applications and the cause of financial failures of companies failing to see changes in market behaviour because their views were restricted by the programmers whose views were restricted to their veneration of their beloved RDBMS schemata.

That is why SQL programmers do not understand why your company statistician refuses to use your application which you crafted meticulously but they employed a college intern to write SQL to download data into their personal servers and that your company executives learn to trust the accountants' and statisticians' spreadsheets rather than your elegant multi-tiered applications because of your applications' inability to mutate with processes.

It might not be possible, but I still urge you to acquire some statistical understanding to perceive how processes mutate over time so that you can make the right technological decision.

The reason people are not moving to SQL-less is lack of a good scripting environment like SQL to perform adhoc relationship analysis. Not because SQL-less technology is deficient in precision or integrity. Adhoc relationship analysis is very important nowadays due to the rapid and agile application development attitudes and strategies we have nowadays.

Blessed Geek
@h2g2java, I think you may be in danger of promoting at least two misconceptions here. Firstly, relational databases have nothing much to do with "relationships" per se. The R in RDBMS refers to the mathematical concept of a *relation* - which is not at all what I understand you to mean by a *relationship* (i.e. a semantic "association among things"). Secondly, SQL DBMSs are not relational anyway. The commonly cited drawbacks of SQL DBMSs are generally consequences of the SQL model and have nothing to do with the relational model - something which NoSQL advocates often seem confused about.
dportas
The American Constitution today has less to do with what "the founding fathers had originally intended it" than whom our elected Congressmen and venerated Presidents have chosen to fill the seats of the Supreme court.
Blessed Geek
@Blessed valuable insights, thanks for the contribution!
Rex M
+5  A: 

Let me hit the questions one at a time:

I know I can't do transactions across relationships... when would this be a big deal?

Picture cascading deletes. Or even just basic referential integrity. The concept of "foreign keys" can't really be enforced across "collections" (the Mongo term for tables). You can do atomic writes to only a single "document" (AKA record). So if you have a DB issue, you can orphan data in the DB.

I get as much performance as I have CPUs and RAM for free?

Not free, but definitely with a different set of trade-offs. For example, Mongo is great at running single-record, key/value look-ups. However, Mongo is poor at running relational queries. You'll need to use map-reduce for many of these. Mongo is a "RAM-whore". Mongo basically demands 64-bit for any significant dataset. Mongo will suck up drive space, load up a 140GB DB and you can end up using 200+ GB as the swap file grows during use.

And you're still going to want a fast drive.

In fact, I think it's safe to say the MongoDB is really a DB system that caters to leading-edge hardware (64-bit, lots of RAM, SSDs). I mean, the whole DB is centered around looking up data index data in RAM (hello 64-bit) and then doing focused random lookups on the drive (hello SSD).

why ... doesn't the entire industry jump ship from MySQL?

  1. It's not ACID-compliant. Probably quite bad for the banking system (of course, most of them are still processing flat files, but that's a different issue). However, note that you can force "safe" writes with Mongo and guarantee that data gets to disk, but only one "document" at a time.
  2. It's still very young. Lots of big business are still running old versions of Crystal Reports on their SQL Server 2000 app written in VB6. Or they're building enterprise service buses to manage the crazy heterogeneous environments they've built up over the years.
  3. It's a very different paradigm. Maybe 30% of the questions I regularly see on Mongo mailing lists (and here) are fundamentally tied to "how do I do query X?" or "how do I structure this data?". Using MongoDB typically requires that you denormalize in advance. This is not only a little difficult, it's untrained. Most people only learn "normalization" in school, nobody teaches us how to denormalize for performance.
  4. It's not the right tool for everything. Honestly I think that MongoDB is great tool for reading and writing transactional data. That simple "one-a-time" CRUD that comprises much of modern apps. However, MongoDB is not really great at reporting. In fact, I honestly envision that the next step is not "Mongo for everything" it's "Mongo for transactional" and "MySQL for reporting". When your data gets big enough that you throw out "real-time reporting", then using Map-Reduce to populate a reporting DB doesn't seem that bad.

As I understood it, as you scale, you get MySQL to feed Memcache. Now it appears I can start with something equally performant from the beginning.

Honestly, I'm working towards this on a few of my projects. Again, I think that MongoDB actually does make a valid caching layer. In fact, it makes a file-backed caching layer. So if you're capable of pushing MySQL change to Mongo, then you're getting getting Memcached without cache misses. It also makes it easy to "warm the cache" on new server, just copy files and start Mongo pointing at the correct folder, it really is that easy.

Gates VP
A relational db is also a "RAM-WHORE". Relational db's become slow when indexes are too big to fit in RAM. MongoDB makes it easier to shard the data over multiple inexpensive computers. You can store 1-n relation in one collection so cascading deletes between collection are often not needed. Life becomes more difficult when you want to do n:m relations with Mongodb. What worries me most about MongoDB is durability. I agree that MongoDB's map-reduce is slow. "select ... group by" is also easier.
TTT
On MongoDB and durability: http://www.mikealrogers.com/2010/07/mongodb-performance-durability/
TTT