views:

157

answers:

2

In our application, we support user-written plugins.

Those plugins generate data of various types (int, float, str, or datetime), and those data are labeled with bunches of meta-data (user, current directory, etc.) as well as three free-text fields (MetricName, Var1, Var2) .

Now we have several years of this data, and I'm trying to design a schema which allows very fast access to those metrics in an analytical fashion (charts and stuff). This is easy as long as there are only a few metrics we're interested in, but we have a large number of different metrics at different granularities, and we'd like to store user-added data to allow for later analysis (possibly after a schema change).

Example data: (please keep in mind this is very simplified)

=========================================================================================================
| BaseDir         | User    | TrialNo | Project | ... | MetricValue | MetricName | Var1 | Var2      |
=========================================================================================================
| /path/to/me     | me      | 0       | domino  | ... | 20          | Errors     | core  | dumb      |
| /path/to/me     | me      | 0       | domino  | ... | 98.6        | Tempuratur | body  |           |
| /some/other/pwd | oneguy  | 223     | farq    | ... | 443         | ManMonths  | waste | Mythical  |
| /some/other/pwd | oneguy  | 224     | farq    | ... | 0           | Albedo     | nose  | PolarBear |
| /path/to/me     | me      | 0       | domino  | ... | 70.2        | Tempuratur | room  |           |
| /path/to/me2    | me      | 2       | domino  | ... | 2020        | Errors     | misc  | filtered  |

Anyone can add a parser plugin to start measuring a AirSpeed metric, and we'd like our analisys tools to "just work" on that new metric.


Update:

Considering that many of the MetricName's are well-known beforehand, I can satisfy my requirements if I can enable analysis on those metrics, and simply store the other user-added metrics. We can accept the fact that new metrics won't be available for heavy-duty analysis without an edit to the schema.

What do you guys think of this solution?

I've divided our metrics into three fact tables, one for facts that don't need a MetricTopic, one for ones that do, and one for all the other metrics, including unexpected ones.

Metrics Schema #3


For the bounty:

I'll accept any critique which shows how to make this system more functional, or brings it into closer alignment with industry best-practices. References to literature gives added weight.

A: 

I could add another column for every metric we care about, but that could range into the hundreds or even thousands. I'd write a script just to update the schema, and that smells like bad design.

You don't have that many facts. There aren't that many units.

Facts have units. Seconds, pounds, bytes, dollars.

You need to review the "Star Schema" design. You have dimensions (probably a lot) and measurable facts (probably very few).

You have a join between facts and all of the associated dimensions. You can do sum, count on the facts, and group-by on the dimensions.

You can't have thousands of independent facts. That's almost impossible. But you can have thousands of combinations of dimensions, that's common.

Separate facts (measurable quantities that add pleasantly) from dimensions (definitional qualities) and you should have a lot of dimensions around a few facts.

Buy a copy of Kimball.

S.Lott
I actually bought Kimball on Kindle just after posting this question.
bukzor
@S.Lott: I agree. If I look closely, I only have maybe 40 different facts, but I still have the problem that someone can add a new type of fact without warning, requiring a update to the schema. Is there any common wisdom for a system with indeterminate facts? Make a dimension called FactName?
bukzor
+4  A: 

If I understand correctly, you are looking for a schema to support on-fly creation of measures in a DW. In a classical data warehouse each measure is a column, so in a Kimball star you would need to add a column for each new measure -- change the schema.

What you have is an EAV model, and analytics on EAV is not easy and not fast -- take a look at this discussion.

I would suggest you look at tools like splunk, which is suited for theis type of problems.

Damir Sudarevic
@Damir: Thanks! At least I have a name for my problem now. Do you know of any authoritative writing on analytics for EAV?
bukzor
@bukzor, no not really.
Damir Sudarevic