ansaurus

Question

What is the best practice to represent multi dimensional data in tables?

Answer 1

A:

EAV might be what are you after.

devnull 2010-09-06 10:28:52

Answer 2

+1 A:

If each answer will always have both dimensions to it, then you can have one table with three columns, and those three columns are your three dimensions.

In general, a n-column table is the relational way to represent n distinct dimensions of information.

Erwin Smout 2010-09-06 14:23:30

Answer 3

+4 A:

The example has five dimensions: User, Question, Answer, Survey, and Demographic. The Demographic table is here to freeze the income level and age group at the time of the survey -- the user can change income and age over time. The CurrentDemographicID in the User table is pre-calculated before the survey, and is later copied to the DemographicKey of the UserQA table.

alt text

select 
      AgeGroup
    , IncomeLevel
    , count(1)   as NumberOfPeople
from UserQA      as f
join Question    as q on q.QuestionID    = f.QuestionID
join Answer      as a on a.AnswerID      = f.AnswerID
join Demographic as d on d.DemographicID = f.DemographicID
join Survey      as s on s.SurveyID      = f.SurveyID
where QuestionFullName = 'specific question here'
  and AnswerFullName   = 'specific answer here'
  and SurveyFullName   = 'specific survey here'
group by AgeGroup, IncomeLevel
order by AgeGroup, IncomeLevel ;

Also, take a look at this question/answer.

Damir Sudarevic 2010-09-06 15:57:46

Wow way to go all out.... nice!

Capt Otis 2010-09-06 16:39:20

Answer 4

A:

It is highly probable that whoever voted down devnull's answer has insufficient experience in industrial data modelling or had experience only in simplistic exercises.

As a generalisation, for multiple dimensional analysis to be effective you need two levels data modelling.

A substrate detail capture
An abstract cube

The answer is indeed EAV as the substrate detail capture. People can afford to be lazy and skip the detail substrate design. Every data mining tool I had been forced (by management) to evaluate has been inadequate because the inability to perform time-profiling.

Ultimately, it required us to write our own data-mining application - because the amount of kludge to make the software work, albeit ineffectively, equals to the amount of effort it takes to create one that works effectively. And even works extra-ordinarily spectacular when we fed slices of the abstract cube into SAS/Insight. While the vendors' software took an hour to build the cube, and sometimes 24 hours, we needed our cubes to be built within 5 minutes and frequently within 15 seconds - due to effective substrate data modelling to enable efficient slicing of data to build the cube.

In industrial practice, we might need to flexibly move and range the time window of an analysis to match conditions that occured 6 months ago. Or that, we need to review the performance of processes for the last financial year.

We need to have correct time-phase correlation. Let's say a church has a per night statistic for a particular evangelistic rally as number of people "saved", number of "rededications", number of "contribution$", total contribution$, number of attendance - over 10 nights of the rally.

Then the church has statistics for number of tracts distributed by volunteer and the location each volunteer was assigned, the radio and tv advertisements and google ads placed. So the church statistician would need to collect info from each convert and $contributor - location of residence, when and where a tract was received, when and where an advertisement was encountered. And if they could collect such information from everyone who attended and who told them about it and when and where those who told them about the rally received a tract or encountered an advertisement. With such info at hand, the statistician would then be able a more truthfully matched multi-dimensional cube to let church leaders decide how to effectively organise their future pre-rally out-reach.

Data model optimization may not be possible for churches but for it is very possible for a factory full of equipment, robots, and trained operators. It is extraordinarily helpful to get phase-correlated data to treat an epidemic.

The EAV model is necessary because the number of parameters collectible, and number of attributes, vary from equipment to equipment and from process to process and from disease to disease. We cannot afford to have a table each for each set of parameters. Frequently a process or piece of equipment collects different sets of parameters, depending on the product that is being processed.

Sometimes the number of parameters collected can number a thousand. We cannot have a table with a thousand columns, can we? We even violate data normalization principles to store data in a single row or in a blob because of inefficient data access in a highly normalized table.

Further, we also need to version our data set. Let us say we designed an experiment for 2004. In 2006, we discovered that we needed to include new dimensions and discard some useless dimensions, and so we created a new version of the experiment. Then when we analyse the performance of the experiment between 2002 and 2008, we need to provide proper treatment to the change in the sets of dimensions in the experiment. In biological experiments and social behaviour surveys, the version changes would be more frequent.

I have a relational model of variable-dimensionality for attributes and parameters here: http://appdesign.blessedgeek.com/discrete-flow-resource-management. Not exactly EAV, but gives an idea what industrial multi-dimensional data modelling entails.

Blessed Geek 2010-09-06 23:51:47

ansaurus

tags:

views:

answers:

What is the best practice to represent multi dimensional data in tables?

related questions