views:

742

answers:

5

Hi, I need to store a sparse matrix on disk. It is like a database table with millions of rows and thousands of columns, where many or most columns are null. It needs to be queryable, like a SQL SELECT with a WHERE on some of the columns.

My specific requirement is on Java. I first thought of using Berkeley DB for Java to simulate a table, but then it does not support querying based on values.

Then, I thought about using a regular SQL database. For example, creating a schema with only a Row ID, a Column ID, and the value. The virtual row will be all the actual rows with the same ID. But then, this looks like database abuse.

Any ideas?

A: 

It depends on your definition of "many or most columns are null", but that sounds like a very reasonable approach, assuming that you really need the random access.

If you can do everything by sequential processing (e.g. a scan in row order) then a flat file would be another reasonable option to consider.

joel.neely
A: 

the Intersystems Cache database uses structures internally for storing data, which are sparse multidimensional arrays. Maybe check that out. You can query it, and map it to SQL tables. I'm not sure if you can directly access the multidimensional arrays in Intersystems Cache from java though.

Raymond Roestenburg
+2  A: 

The first thing that came to my mind when reading the question heading was a database row per (x,y) as you suggested in your next to last paragraph.

The other thing to note is that databases often compress the rows, particularly for NULLs, so the straightforward representation may not waste as much space as you think.

Doug Currie
A: 

If you only need to save the data as you say "on disk", read it back & make certain queries, you might want to consider simply serializing the class of your own design and choice. Eliminates all need for database infrastructure, but obviously also excludes many of the things that a database would actually help you with,

fredarin
+1  A: 

Recently, I've become interested in key-value type storage, and came across this blog entry describing how FriendFeed uses MySQL to implement a similar requirement:

http://bret.appspot.com/entry/how-friendfeed-uses-mysql

mparaz