views:

196

answers:

2

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:

  1. Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.

  2. Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.

I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.

Are there any other solutions? Which one should I use?

+1  A: 

A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.

However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.

If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).

If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.

Magnus Hagander
+1  A: 

A few solutions spring to mind,

1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying

2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table

3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables

4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)

5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated

6) live with lots of empty fields

MkV
Also you could create a 'feature' TYPE AS featurename VARCHAR, featurevalue VARCHAR (or whatever the value needs to be) and add a FEATURES field of type feature[] to your primary table.
MkV
Why do you call EAV an "anti-pattern"? Googling shows this is a common description of EAV (usually used disparagingly), but no one seems to explain why. There seem to be many valid cases where databases need to store sparse data, such as the medical field, making EAV a suitable "pattern".
Chris S
It throws away all the advantages of the database, row level constraints and referential integrity and makes it difficult to return a single row for a single entity.
MkV