views:

280

answers:

6

I'm developing an application that will store a sizeable number of records. These records will be something like (URL, date, title, source, {optional data...})

As this is a client-side app, I don't want to use a database server, I just want the info stored into files.

I want the files to be readable from various languages (at least python and C++), so something language specific like python's pickle is out of the game.

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

Is my reasoning correct? If so, how should I use BDB to store my records? Can you link me to relevant info? Or am I missing a better solution?

A: 

What about MongoDB? I haven't tried it yet, but it seems interesting.

Bastien Léonard
Looks interesting... Doesn't seem to be really mature yet, though.
static_rtti
+2  A: 

BerkeleyDB is good, also look at the *DBM incarnations (e.g. GDBM). The big question though is: for what do you need to search? Do you need to search by that URL, by a range of URLs or the dates you list?

It is also quite possible to keep groups of records as simple files in the local filesystem, grouped by dates or search terms, &c.

Answering the "search" question is the biggest start.

As for the key/value thingy, what you need to ensure is that the KEY itself is well defined as for your lookups. If for example you need to lookup by dates sometimes and others by title, you will need to maintain a "record" row, and then possibly 2 or more "index" rows making reference to the original record. You can model nearly anything in a key/value store.

Xepoch
"You can model nearly anything in a key/value store." Could you recommend something to read on this? I can see that this model is very general, but reading a few examples would be useful.
static_rtti
I can see what I can find, but the traditional basics of an underlying DB store are effectively a key/value store in some mechanism or another. A heap table is just rows written into a key/value with the row as the value and the key a generated ROWID of sorts. A non-compound index on such a table lists the values of the index as the key and the ROWID as the value. Sure it gets more complicated than that but *nothing cannot be solved without another level of indirection* applies here. I'll comment back if I can find a few articles.
Xepoch
+5  A: 

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

What you are describing is exactly what relational is about, even if you only need one table. SQLite will probably make this very easy to do.

EDIT: The relational model doesn't have anything to do with relationships between tables. A relation is a subset of the Cartesian product of other sets. For instance, the cartesian product of the Real numbers, Real Numbers, and Real numbers (Yes, all three the same) produce 3d coordinate space, and you could define a relation upon that space with a formula, say x*y = z. each possible set of coordinates (x0,y0,z0) are either in the relation if they satisfy the given formula, or else they are not.

A relational database uses this concept with a few additional requirements. First, and most important, the size of the relation must be finite. The product relation given above doesn't satisfy that requirement, because there are infinitely many 3-tuples that satisfy the formula. There are a number of other considerations that have more to do with what is practical or useful on real computers solving real problems.

A better way of thinking about the problem is to think about where each type of persistence mechanism specifically works better than the other. You already recognize that a relational solution makes sense when you have many separate datasets (tables) that must support relationships between them (foreign key constraints), which is almost impossible to enforce with a key-value store. Another real advantage to relational is the way it makes rich, ad-hoc queries possible with the use of proper indexes. This is a consequence of the database layer actually understanding the data that it is representing.

A key-value store has it's own set of advantages. One of the more important is the way that key-value stores scale out. It is no consequence that memcached, couchdb, hadoop all use key-value storage, because it is easy to distribute key-value lookup across multiple servers. Another area that key-value storage works well is when the key or value is opaque, such as when the stored item is encrypted, only to be readable by it's owner.


To drive this point home, that a Relational database works well even when you just don't need more than one table, consider the following (not original)

SELECT t1.actor1 
FROM workswith AS t1, 
     workswith AS t2, 
     workswith AS t3, 
     workswith AS t4, 
     workswith AS t5,
     workswith AS t6
WHERE t1.actor2 = t2.actor1 AND
      t2.actor2 = t3.actor1 AND
      t3.actor2 = t4.actor1 AND
      t4.actor2 = t5.actor1 AND
      t5.actor2 = t6.actor1 AND
      t6.actor2 = "Kevin Bacon";

Which, obviously uses a single table: workswith to compute every actor with a bacon number of 6

TokenMacGuy
Could you elaborate? For me relational only really makes sense if you have several tables with relationships between them...
static_rtti
+2  A: 

Personally I would use sqlite anyway. It has always just worked for me (and for others I work with). When your app grows and you suddenly do want to do something a little more sophisticated, you won't have to rewrite.

On the other hand, I've seen various comments on the Python dev list about Berkely DB that suggest it's less than wonderful; you only get dict-style access (what if you want to select certain date ranges or titles instead of URLs); and it's not even in Python 3's standard set of libraries.

andrew cooke
"it's not even in Python 3's standard set of libraries.". Didn't know that, that's a very good point, thanks!
static_rtti
Please check. I had a look and I can see (g|n)dbm support, but I think that's different, right? Maybe the discussion I remember in the dev list was related to dropping it.
andrew cooke
+1  A: 

If you're only going to use a single field to look up records, a simple key-value store would be a good choice. Store that single field (or any other unique ID) as your key, serialize each record as a string (using JSON or similar), and store that string as the value. Berkeley DB is certainly a reasonable choice for a key-value store, but there are many alternatives to choose from: http://en.wikipedia.org/wiki/Dbm

If you want to look up records by any of several fields, SQLite might be easiest for development purposes. You'll be writing queries in SQL but you won't have to maintain a database server. All the multi-key machinery is already written for you.

If you really want to avoid SQL or squeeze every bit of performance out of your data store, and you want multi-key access, consider a layer of extra logic on top of a key-value store. It is possible to build column-like behavior on top of key-value stores by serializing your records and inserting the "column" values of each record as additional keys whose values contain the "primary" key of your record. (You're effectively using the key-value store as both a dictionary of records and a dictionary of indexes for finding those records.) Google's App Engine does something like this. You can do this yourself or use one of various document-oriented databases that will do it for you. For some interesting reading, try googling "nosql". http://www.google.com/search?&q=nosql

Forest
P.S. The deal with Berkeley DB in the python distribution is simply that the bdb library internals were changing more frequently than the Python devs wanted to keep up with. It's not that Berekeley DB was bad, just inconvenient to integrate directly into python releases. You can still get the bdb python bindings as a separate module.
Forest
A: 

Ok, so you say just storing the data..? You really only need a DB for retrieval, lookup, summarising, etc. So, for storing, just use simple text files and append lines. Compress the data if you need to, use delims between fields - just about any language will be able to read such files. If you do want to retrieve, then focus on your retrieval needs, by date, by key, which keys, etc. If you want simple client side, then you need simple client db. SQLite is far easier than BDB, but look at things like Sybase Advantage (very fast and free for local clients but not open-source) or VistaDB or firebird... but all will require local config/setup/maintenance. If you go local XML for a 'sizable' number of records will give you some unnecessarily bloated file-sizes..!

andora