views:

155

answers:

4

Hi,

I am developing an application where I have to deal with an Entity named 'Skill'. Now the thing is that a 'Skill A' can have a certain relevancy with a 'Skill B' (the relevancy is used for search purposes). Similarly 'Skill B' can also be relevant to 'Skill C'. We currently have the following data model to represent this scenario

Skill {SkillId, SkillName}

RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel}

Now given the above scenario we have a implicit relation between 'Skill A' and 'Skill C'. What would be the optimal data model for this scenario be? We'd also have to traverse this hierarchy when performing search.

+1  A: 

Your best bet is to:

  1. augment RelevantSkill with an ImplicitRelevance boolean column:

    RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel, ImplicitRelevance}

  2. insert (into the RelevantSkill table) rows corresponding to all implicit (indirect) relevance relationships (e.g. "Skill A" -> "Skill C") with their corresponding computed RelevanceLevel's, if and only if the computed RelevanceLevel is above a set threshold. These rows should have ImplicitRelevance set to true

    skill_a_id, skill_b_id, computed_level, 'T'

If any changes are made to the explicit relevance levels (metrics), remove all rows with ImplicitRelevance=true and recompute (re-insert) them.

Cheers, V.

vladr
+1  A: 

Something left open by your explanation is how the relevance levels are combined in the case of the indirect ("implicit") relationships. E.g. if skill A is relevant to B with level 3 and skill B is relevant to skill C with level 5, what is the level (as a number) of the indirect relevance of skill A to skill C?

The proper data model depends on two things: how many skills you have, and how dense the relationship structure it is (dense = lots of skills are relevant to others). If the relationship structure is dense and you have few skills (< 1000) you can be best off be representing the whole thing as a matrix.

But if you have many skills but a sparse relationship structure you can represent it as three tables:

Skill {SkillId, SkillName}

RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel}

IndirectRelevance { SkillId, RelevantSkillId, RelevanceLevel}

The third table (IndirectRelevance) is calculated based on the two primary tables; whenever you change Skill or RelevantSkill tables, you need to update the IndirectRelevance table.

I think it is better to have three tables than two; this makes the implementation cleaner and more straightforward. RelevantSkill contains the explicitly stated relationships; IndirectRelevance all derived facts.

antti.huima
A: 

There are some factors to consider before you can choose best options:

  • how many skills are there?
  • are relations sparse or dense (i.e. are skills related with a lot of other skills)?
  • how often do they change?
  • is there a relevancy threshold (minimal relevancy that is of interest to you)?
  • how is multi-path relevancy calculated?

The structure obviously will be like antti.huima proposes. The difference is how IndirectRelevance will be implemented. If there are a lot of changes, lot of relations and relations are dense, then the best way might be stored procedure (perhaps accessed through a view). If there relations are sparse and there is threshold, the best option might be materialized view or a table updated via triggers.

vartec
+1  A: 

What you're asking for seems to be basically a graph distance algorithm (slash data structure) computed from a set of pairwise distances. A reasonable (and nicely computable) metric is commute time.

It can be thought of thus: construct a graph where each node is a Skill, and each edge represents the relevancy of the nodes it connects to each other. Now imagine that you're starting at some node in the graph (some Skill) and randomly jumping to other nodes along defined edges. Let's say that the probability of jumping from Skill A to Skill B is proportional to the relevancy of those skills to each other (normalized by the relevancy of those to other skills ...). Now the commute time represents the average number of steps it takes to make it from Skill A to Skill C.

This has a very nice property that adding more paths between two nodes makes the commute time shorter: if Skill A and B, B and C, C and D, and D and A are related, then the commute time between A and C will get shorter yet. Moreover, commute time can be computed quite easily using an eigenvalue decomposition of your sparsely connected Skill graph (I think the reference I gave you shows this, but if not there are many available).

If you want to actually store the commute time between any pair of Skills you'll need a fully-connected graph, or NxN matrix (N is the number of Skills). A far nicer variant, however, is to, as stated above, drop all connections weaker than some threshold, then store a sparsely connected graph as rows in a database.

Good luck, and I hope this helped!

alex