views:

375

answers:

2

I'm working on a star schema for analysis of posted form data. The site that the form data will be posted to is actually external to the site hosting the form, so only the data in the form will be available. I'm going to give the option to include some extra useful information with hidden fields, original referrer, session id etc.

I'll be able to use regular expressions to match certain data types and pull them out to specific dimensions e.g. Postcodes.

I have a solution to dealing with the arbitrary nature of the dimensions, its not a great one but it will work.

The problem that I have is that I have no idea what is going to be in my fact table, its not like there is a nice numerical value that I can aggregate. Apart from the fact that "yes there is a form post" that satisfies these criteria.

I'm wondering if I'm approaching this in the right way? Am I using the wrong tool for the job? Or am I just missing something?

Simon.

Further detail:

There are two areas of functionality, filtering the form posts dependant on criteria e.g. between two timestamps. But pretty much anything is up for grabs in terms of filtering. The selected form posts will then be used to generate a csv file for export.

The other main area is analytics, studying the conversion of ad spend into customer leads is an obvious starting point. Also somewhat open ended and depends on the form data.

+1  A: 

You aren't designing a star schema. You're designing an Entity-Attribute-Value table, which has all the problems you're identifying.

If you really have no idea what your data will look like, i.e. what form fields exist and what data types should be used for each one, then a relational database is not the right tool to persist the information. Try XML or YAML or JSON. Those are structured, but dynamic, formats. You can establish metadata on the fly. You can store the whole form instance in a file or in a BLOB in your database.

Another emerging technology that can manage dynamic metadata is RDF, with the query language SPARQL. Sesame is an example of a semantic data engine.

Bill Karwin
Thanks for that, I was thinking of something close to EAV, glad to see I wasn't totally lost. I still need to do analytics so I'm thinking that a combination of star-schema and EAV might work.I'll have to be careful with the meta data though.
Simon Farrow
"A combination of star-schema and EAV" makes no sense.
Bill Karwin
A: 

It's ok to have fact tables with no measurements - they're just called "factless fact tables". But you still typically put a row_count column in there - even though it'll always have a value of one - to easily add summary tables. And you may end up adding other measurements later - like a measurement of the sentiment of the term for example.

And I wouldn't get too worried that this doesn't look like a warehousing 101 example - there are a lot of corner cases where odd things happen. You can certainly have field_name & field_value as columns, or even just field_value if you don't have field_name. That works. And it provides a ton of flexibility.

But you are missing out on some important functionality. Since a given item or object is really split across multiple rows - typical sql filtering won't work well. You typically need to pull all the rows into a little app that can evaluate them as a whole - or write some very complex multi-step sql where you insert the boolean results of each row evaluation to a temp table, then group by session_id (or whatever the equiv), then finally evaluate for and/or logic.

Another option - is to go this route, but gradually develop your ETL parsing functionality so that over time you can pull some of this stuff out into more traditional dimensions. Perhaps this becomes your staging or raw table, but you try to have most reports hit your more traditional star schema.

Last option - consider a non-relational database. Something more document-oriented may provide better functionality for you.

KenFar