views:

247

answers:

5

I'm looking for a database that supports the following functionality:

1) Records in the database are like Python dictionaries or Perl hashes. For example, a "purchase" record might look like this:

<purchase 5436> = { product: "BMX Bike", price: 99.50, city: "Springfield" }

2) The records are stored in arrays of variable length. The database contains lots of these arrays. For example, the purchase table might look like this:

purchase array 1: [ <purchase 5436>, <purchase 54>, <purchase 112> ]
purchase array 2: [ <purchase 76>, <purchase 5984>, <purchase 1102>, <purchase 12> ]
...
purchase array 658: [ <purchase 10142>, <purchase 35>, <purchase 6458>, <purchase 23> ]

3) I want to be able to do two kinds of queries on this database:

3a) Count the # of records that match a various criteria. For example, how many purchase were made with a value over 50? I know of lots of databases that support this.

3b) Count the number of times records appear in a certain order. For example, how many arrays are there were a purchase over 50 was made and then a purchase in "Springfield" was made? I don't know what kind of database you would use to do this.

edit: Response to Steve Haigh: I should have mentioned that speed is important, and this database needs to support gigabytes of data. For example, there might be 1,000,000,000 purchase arrays, and I want to count how many of them have a purchase in "Springfield" followed by a purchase in "Hometown" (note that order is important). Maybe I'm wrong, but I think a relational DB would be too slow for this purpose.

A: 

I am not sure i completely understand what you are looking for , but have you looked at couchdb? . Its document oriented and schema free

Surya
+2  A: 

Are you sure you can't do this with a relational DB using a link or junction table?

You would have a column of orders, a column of products and a table order-products which has a row for every product per order.

I think this article probably expresses better than I could.

Steve Haigh
+2  A: 

For example, there might be 1,000,000,000 purchase arrays, and I want to count how many of them have a purchase in "Springfield" followed by a purchase in "Hometown" (note that order is important). Maybe I'm wrong, but I think a relational DB would be too slow for this purpose.

What you describe are typical data warehouse queries, and AFAIK those are usually implemented using relational DBs, albeit ones that are optimized for reporting rather than for concurrent transaction processing. However, I don't think the difference in speed will be extreme if you use a "regular" RDBMS. Of course, if you have enough money, you could go for a special data warehouse DBMS.

The most important influence on speed is going the be 1) a technology optimized for quering large disk-based data sets - thats exactly what all "real" DMBSs offer, and 2) data organized in the right way.

3b) Count the number of times records appear in a certain order. For example, how many arrays are there were a purchase over 50 was made and then a purchase in "Springfield" was made? I don't know what kind of database you would use to do this.

You would use a relational DB with a schema designed to support that kind of query. You are going to have to give up your preconceived notion of how the data should be represented.

Michael Borgwardt
+1  A: 

You don't really need a relational database as you just have key->value pairs grouped in collections, you would need joins between the two tables (one for the records, one for the collections) to iterate the records in a collection and in your case is not worth the cost.

For your performance requirements, what you need is to make sure that the whole structure fits in memory and doesn't require access to disk. You might need several servers to do this, and a master that dispatches the lookups to the other servers (assuming that the size of your structure is bigger than the reasonable amount of memory that a modern server can handle, and that your speed requirements are so big that you cannot afford disk pagination.

For the kind of queries that you mention, your best option is to have a bit of data redundancy. On insertions, you would keep track of those counts. Data redundancy tents to freak out people just by reading the name, but it is sometimes necessary. Just be extremely careful with your implementation and invest a good amount of unit testing here.

There might be, though, some kind of queries, that you won't ever be able to do in real time in a matter of miliseconds, and that one about finding purchases with one condition followed by purchases with another condition seems like this. Either you find a way of maintaining a live tracking of this numbers while inserting/deleting/modifying, or you will have to actualy iterate your millions of arrays, no way to avoid that. You will need to consider how recent your data needs to be, and maybe pre-calculate every few hours to generate those statistics and then be able to access them in O(1) with lookup keys.

In a nutshell, your problem is way beyond the technology you decide to use to solve it.

palako
Not related? Then what exactly do you mean by "grouped in collections"?
bubaker
well, you are right, that is a relationship. I meant that is not worthy to expend time with Joins for such a simple parent/children structure, but you are correct, I'll edit my answer. Thanks for pointing it out.
palako
A: 

What you are describing is quite similar to MUMPS even if I have some doubts about the ability to define queries where the order of "records" in the arrays is possible.

Have a look at the link, there are also current commercial versions of this as you will see.

p.marino