sites like IBM's many eyes, swivel etc store varieties of data and allow their users to visualize them. How do they design their tables? For example, if you were to save the data from data.gov site into a database and allow your users to perform operations on it, how would you go about designing the tables? The structure needs to be generic enough to hold any type of data. data.gov for example, has tons of data, some of them more complex than the others.
views:
140answers:
6The most simplest way of answering your question is to say use a string-string dictionary. its a popular structure in the NOSQL community and the flexibility python and lua built on it as well. You can specialize it for you domain by adding the dimension of time and such - i.e., hypertable does that.
Any data model can be serialized to a string-string dictionary. I don't know the specifics but MySql has a BDB back-end. BDB's core data structures are string-string.
p.s., i'm half of a relational zealot as well, so if the data is important I'd model it relationally :P
I can't be much help but this article, How Friendfeed uses MySQL to store schemaless data, might be of some use.
You could also check out document-orientated databases like CouchDB or MongoDB
The key question is whether it's the simple retrieval of the data that's important or the aggregating and searching through it.
Ie// What are you using the data FOR?
If it's just data (ie// it's just some random text/binary), I wouldn't bother with a database at all. Just slap it in a series of files, strip it of encoding and use grep / sed / awk / LISP to move through it without any labels. Data is only really useful for search / retrieve operations rather then deep trending.
If it's a single row or element of data (like a Stack Overflow question or comment), I'd consider either the NOSQL patterns (essentially, just lookups) or an OODB.
If it's the relations that are important, I'd model it like a graph, with edges and nodes. Nodes contain data, edges contain relationships. I'd be tempted at that point to implement it manually using disk based pointers.
If it's the sets of data (ie// considering characteristics of the data together) that are important, I'd think long and hard about the key groupings and design the relational database tables that way. If the design needed to change to accommodate new information and sets, then I'd manipulate the table structures to better model it when I learned about the new requirements.
Much data can be indexed using the multidimensional format with (time, space, label) as the key and (attribute set, aggregatable characteristics, data) as the payload. Attributes map to dimensions and can be "rolled up" with the aggregatable characteristics (counts, sums, max/min, avg, stdev, etc...).
Your question is a little vague without the why though, and it's the why that's critical to figuring out the design.
A really good example of this type of system exists within the Drupal CMS module Content Construction Kit which, when integrated with the Drupal Views module is a great demonstration of how to not only manage a database with dynamic structure, but also how to make the content accessible to the users which is just as important as storing the data itself.
I was blown away when I realized how capable these two systems are. Drupal and these modules are open source, so you can of course analyze it as much as you need to understand the concepts behind it all.
If you can't determine the exact data model in advance, and also need to handle complex data, I actually think tables are not the best underlying abstraction to use. A graph as underlying model is a much better fit to these requirements. You could look into graph databases (AllegroGraph, Neo4j, VertexDB) or use RDF (which is a standardized graph data model also supported by AllegroGraph and Neo4j). RDF makes your data less dependent on a specific tool set. Some good starting points:
- Why Semantics? (what's so good with using RDF?)
- Linked Data (usage of RDF for public data)
- "Comics" is hard (overview of data models like key/value and graph)
-- disclaimer: I'm on the Neo4j team