how to design a database to save data whose structure is not always known during design, and could change later?

views:

140

answers:

+4 Q:

how to design a database to save data whose structure is not always known during design, and could change later?

sites like IBM's many eyes, swivel etc store varieties of data and allow their users to visualize them. How do they design their tables? For example, if you were to save the data from data.gov site into a database and allow your users to perform operations on it, how would you go about designing the tables? The structure needs to be generic enough to hold any type of data. data.gov for example, has tons of data, some of them more complex than the others.

+1 A:

The most simplest way of answering your question is to say use a string-string dictionary. its a popular structure in the NOSQL community and the flexibility python and lua built on it as well. You can specialize it for you domain by adding the dimension of time and such - i.e., hypertable does that.

Any data model can be serialized to a string-string dictionary. I don't know the specifics but MySql has a BDB back-end. BDB's core data structures are string-string.

p.s., i'm half of a relational zealot as well, so if the data is important I'd model it relationally :P

Hassan Syed 2009-12-02 14:54:21

could you please elaborate a little? I am not sure I get it. if you take data.gov as example, the data does not belong to any one particular domain. how will it work in such a scenario?

2009-12-02 15:19:34

You simply take each row of data and give it a unique identifier to find again and then some generic attributes (like time, tags and space) so you can more easily find it. Eg//(id) -> (time, data)(tag) -> Collection(id)(time range) -> Collection(id)The game then becomes searching for ids.Make sense?It's like a dictionary, where you have a word and a definition indexed according to first letter. If you want to know the meaning (the data), you simply need to find the word (the key).

James 2009-12-02 15:43:06

In a hashed dictionary structure you can model anything. if you wanted to serialized your data.gov data you could simply prefix your keys with data_gov_. if you wanted to represent a relational catalog, table and specific row from the data.gov data you would do dictionary[data_gov_cat_tab_pk]. if you were interested in a named field you could do dictionary[data_gov_cat_tab_pk_field], or dictionary[data_gov_cat_tab_pk] could allow key collions in which case your values could be of the type "field_value". You can serialize anything to a dictionary really.

Hassan Syed 2009-12-02 15:45:06

oh ok, interesting. Not used to looking at relational data this way, makes sense now. thank you.

2009-12-02 16:37:42

I can't be much help but this article, How Friendfeed uses MySQL to store schemaless data, might be of some use.

You could also check out document-orientated databases like CouchDB or MongoDB

jammus 2009-12-02 14:54:24

Have you considered using a schema-free database like couchDB?

lorenzog 2009-12-02 14:58:26

schema-free database? not sure what it is, need to google it. How is the data stored in it? key value pairs?

2009-12-02 15:18:16

take a look at the url, it's very well-worded. I would only copy-paste their description :)

lorenzog 2009-12-02 16:17:07

+1 A:

The key question is whether it's the simple retrieval of the data that's important or the aggregating and searching through it.

Ie// What are you using the data FOR?

If it's just data (ie// it's just some random text/binary), I wouldn't bother with a database at all. Just slap it in a series of files, strip it of encoding and use grep / sed / awk / LISP to move through it without any labels. Data is only really useful for search / retrieve operations rather then deep trending.

If it's a single row or element of data (like a Stack Overflow question or comment), I'd consider either the NOSQL patterns (essentially, just lookups) or an OODB.

If it's the relations that are important, I'd model it like a graph, with edges and nodes. Nodes contain data, edges contain relationships. I'd be tempted at that point to implement it manually using disk based pointers.

If it's the sets of data (ie// considering characteristics of the data together) that are important, I'd think long and hard about the key groupings and design the relational database tables that way. If the design needed to change to accommodate new information and sets, then I'd manipulate the table structures to better model it when I learned about the new requirements.

Much data can be indexed using the multidimensional format with (time, space, label) as the key and (attribute set, aggregatable characteristics, data) as the payload. Attributes map to dimensions and can be "rolled up" with the aggregatable characteristics (counts, sums, max/min, avg, stdev, etc...).

Your question is a little vague without the why though, and it's the why that's critical to figuring out the design.

James 2009-12-02 15:39:16

lets take the most complicated scenario - storing, searching and updating. then graphs is the best way to go?

2009-12-02 16:42:06

Well, probably flat files then. If you don't care about sets or aggregates, then why bother with a database at all?

James 2009-12-02 22:32:26

A really good example of this type of system exists within the Drupal CMS module Content Construction Kit which, when integrated with the Drupal Views module is a great demonstration of how to not only manage a database with dynamic structure, but also how to make the content accessible to the users which is just as important as storing the data itself.

I was blown away when I realized how capable these two systems are. Drupal and these modules are open source, so you can of course analyze it as much as you need to understand the concepts behind it all.

Eric Cosky 2009-12-02 15:58:46

If you can't determine the exact data model in advance, and also need to handle complex data, I actually think tables are not the best underlying abstraction to use. A graph as underlying model is a much better fit to these requirements. You could look into graph databases (AllegroGraph, Neo4j, VertexDB) or use RDF (which is a standardized graph data model also supported by AllegroGraph and Neo4j). RDF makes your data less dependent on a specific tool set. Some good starting points:

Why Semantics? (what's so good with using RDF?)
Linked Data (usage of RDF for public data)
"Comics" is hard (overview of data models like key/value and graph)

-- disclaimer: I'm on the Neo4j team

nawroth 2009-12-02 16:27:04

ok, just googled on RDF and came across this site http://www.thisweknow.org/ whats more, they use data.gov catalog :)Thank you for the answer

2009-12-02 16:40:12

neo4j looks very very interesting. thank you.can it be used only with Java?

2009-12-02 17:25:50

You can use it with any language implemented for the JVM. There are wrappers for Ruby, Python, Scala, Groovy/Grails and Clojure at the moment. See: http://wiki.neo4j.org/content/Main_Page#Language_bindings

nawroth 2009-12-03 09:18:57

ansaurus

tags:

views:

answers:

how to design a database to save data whose structure is not always known during design, and could change later?

related questions