tags:

views:

133

answers:

5

I'm trying to decide what database system to use for storing information that is relatively static but needs to be computed in a number of different (runtime specified) ways. The basic contours of the data are votes in the US Congress:

A bill:

  • has many roll calls
  • has a name, and other short metadata
  • has text, and other potentially long metadata
  • has a status (passed, failed, in progress)

A roll call:

  • has a date
  • has many votes
  • has a status (passed, failed)

A vote:

  • belongs to a member of Congress
  • has a kind (aye, nay, present, not voting)

A member of congress:

  • has a name (and other short metadata)
  • has many periods

A period:

  • has a start and end date
  • has a political party (Democrat, Republican, other)
  • has a position (member of Congress, committee chair, Speaker, etc.)

I would like to be able to easily build queries like:

  • For X, Y, and Z roll call votes, tell me the "Democratic" position and the "Republican" position. Then, rank congressmen in the congress those votes were held by their fidelity to those positions.
  • For X bill which failed, tell me the closest roll calls. Then, tell me which members of the majority party defected to produce those failures.
  • For X bill passed, but which was opposed by the majority party, tell me which members of the majority defected to produce the passage.

I will have a finite number of query types like these, but the bills, roll call votes, political parties, etc. involved will be dynamically generated.

What is the best storage mechanism for the underlying data that will allow me to issue these queries dynamically and as performantly as possible?

+3  A: 

This looks like pretty standard relational data to me. Any RDBMS (MySQL, SqlServer, postgres, etc.) will do.

Or are you asking advice on how to make tables to store this data?

Seth
+2  A: 

You could use just about any database, until I read:

...rank congressmen...

MySQL doesn't have any ranking functionality. I'm not clear on Postgres' ranking support, but Oracle and SQL Server have supported ranking for a while now (Oracle 9i+, SQL Server 2005+). And they both provide free versions.

OMG Ponies
PostgreSQL 8.4 introduced support for ranking syntax.
Bill Karwin
@Bill: That's a bit more recent than I'd have expected.
OMG Ponies
@OMG Ponies: IIRC, they had tried to add it for 8.3, but didn't quite make it in time. Still fairly recent :-/
Adam Bernier
+2  A: 

Storage mechanism? Any mainstream database should be capable of dealing with the kind of scenario you are describing. Looks pretty standard stuff to me.

Otávio Décio
+1  A: 

As others have stated - any relational database can support a simple model to solve this problem. However, a few other considerations:

  • This is an analytical and not transactional app and the commercial databases are currently stronger at analytics - because of the more mature optimizers, greater sql functionality, greater support for parallelism, materialized queries, automatic query rewrite against summary tables, etc.
  • If you just stick with the US congress and don't decide to also support state congresses and don't also decide to add a hundred years of historical data (all useful requirements), then pretty much any popular relational database could handle the performance issues. But if you do decide to get into the state level then I'd consider the commercial databases first.
  • Of the open source databases I'd consider the analytical functionality of postgresql to be the most mature.
KenFar
A: 

Here's where I would usually chime in and say, use CouchDB or some other schema-free NOSQL database. But the way the problem is spec'd lays out nicely for a relational store. Plus, there's not a terribly large amount of data that would require distributed processing a la mapreduce.

That being said, if the question was framed a bit differently, without the initial relational bias (you're already in data design mode :) ), then a system like CouchDB could work. Depending on the analyses to be performed, a more document-centered approach might be helpful, as all the information needed for an analysis is present on each document (denormalized) and would avoid expensive joins.

Each bill might be one of these docs (json in CouchDB's case), and the rollcalls/votes/congress members with periods as sub-attribs/etc are all on the one 'bill' document. You could then mapreduce over all of the 'bill' documents performing your queries. A different document-oriented design might make sense depending on query requirements.

As the data set grows, you're not worried about size/performance, because you can always use more servers to perform mapreduce queries and distribute the load. Further, schemaless means documents can change as your app changes, without expensive rdbms table locking. But again, this data set doesn't change terribly often, and is not massive.

boorad