views:

563

answers:

7

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.

Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?

I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.

+2  A: 

MongoDB powers SourceForge, The New York Times, and several other large databases...

Justin Niessner
+2  A: 

You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.

These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.

It is worth reading about the CAP theorem too.

BrianLy
+1  A: 

Here's some benchmarks on db4o:

http://www.db4o.com/about/productinformation/benchmarks/

I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.

mgroves
+2  A: 

I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.

One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.

To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.

I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.

A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.

carson
Sorry you had such a miserable experience. If you're still interested, you could post what you're doing on http://groups.google.com/group/mongodb-user/ and maybe we can help? Importing should be very fast and the queries sound like you may have just needed an index somewhere or something.
kristina
Not miserable at all. I'll add that my intent was to make the resulting MongoDB database "correct". I wasn't trying to just make the load match the mysql database I have but instead construct a full document that represented each question, answer, votes and comments. Those are all denormalized in the dump and I think part of the issue was pulling them together. Regardless, the 32 bit limitation was my only true problem. I'm sure I could have spent more time making it work well if I could justify using it.
carson
+4  A: 

Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.

It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.

kristina
+4  A: 

When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.

Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )

db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.

Carl Rosenberger
A: 

Perhaps worth a mention.

The European Space Agency's Planck mission is running on the Versant Object Database. http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951

It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.

Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T

Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.

Here is the implementation: http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf

Here they talk about 3T -vs- 12T found in the testing http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/

Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.

CHeers, -Robert

Robert