views:

183

answers:

3

I would like information on algorithms that can help identify commonality and differences between sets of overlapping data.

Using stackoverflow's tag system as an example:

Let's say this question has been given 5 tags. Let's say there are 1000 other questions that have at least one of these tags. Of these 1000 questions, how many of these questions have tags in common that my original post does not have?

Another more simple way of describing this is an auto-suggest tagging system :

"You tagged your question with [5 tags I selected]. Other similiar questions were tagged with [list of tags that might be of interest]. where [list of tags that might be of interest] are frequently occuring tags that aren't in my orginal list.

Code examples in c# if possible :)

+1  A: 

Look into Wager-Hamming distance. This is the Hamming distance defined on strings as the number of edit operations it takes to transform one string into another.

You could also potentially use the partial order of equivalence classes and set inclusion: when questions A and B have the exact same set of tags up to reordering, they're equal, set union, set difference, and set intersection then define a partial order for < and > comparisons.

Charlie Martin
A: 

Thanks for that, that's opened up a whole new world (which i'm going to ignore for the moment as it will make my problem even more complex ;)

In my current understanding of my problem, a tag either equals a tag or it doesn't; there are no partial matches. Hamming distances seem to be a measurement of similarity between 2 entities. I may have misunderstood this, so please advise if I have.

I am currently thinking of my problem in terms of sets (which may be missleading). My question is in 5 sets. It is in a set of questions tagged with 'tag 1'. Lets call this set S1; all questions in this set have been tagged with 'tag 1'. My question is also in a set of questions tagged with 'tag 2' which lets call S2. So on and forth up to S5.

Now, for all of the other questions in the sets S1, S2, S3, S4 and S5, what sets are they additionally a member of?

EG:

Question 2 is a member of S1 and is also a member of S6, S9 and S27.
Question 3 is a member of S4 and is also a member of S6, S12, S27.

Given these two other questions and their set memebership: I can rate S6 with a score of 2 (Question 2 and 3 are a member of S6) I can rate S27 with a score of 2 (Question 2 and 3 are a member of S27) I can rate S9 and S12 with a score of 1 (q2 is in S9, q3 is a member of s12)

I expect my algorithm to to prompt that given that I coded my question with S1, S2, S3, S4 and S5 I might be interested in also coding S6 and S27.

Noel Kennedy
A: 

I don't know of any specific algorithms or data structures, but I could suggest a basic way of handling this:

Assumption: each entry has five unique tags.

  • Collect all entries containing any one of the five tags (no duplicates).
  • For each entry in the list, use an associative array (hashtable) for each tag, incrementing the value.
  • For each entry in the array, append the tag name into that array's entry index.

In (sloppy) pseudo code, use two loops (if possible):

for each entry
    if any tag in original_tags
        tag_list[tag]++
end

for next in tag_list
    tag_count[tag_list[next]] += next
end

This should produce a sparse array of concatenated tag names (ok, I didn't include a separator, but hey it's pseudo code :-). Keep the highest number, then iterate backward for the best suggestions.

(Cache to optimize, but watch for updates)

Paul.

Paul W Homer