views:

59

answers:

1

We have a set of Documents , each has a set of Features. Given feature A, we need to know what is the probability of having feature B in the same document.

I thought of building a probability matrix , s.t: M(i,j) = Probability of having feature B in a document , given that feature A is there.

However , we have an additional requirement: Given feature A is in the document , what are all the features that have a probability > P of being in the same document.

In the mean while all I could think off is a sparse matrix for the Probability matrix , and after it's computed , for each feature run over all the column , sort it by P , and keep it in a linked list somewhere. (So now , we have for each feature , a list of corresponding features

This space complexity is quite big (worst case: N^2, and N is large!) , and the time complexity for each search is O(N).

Any better idea?

+1  A: 

If the number of features is comparable with the number of documents, or greater, consider holding an inverted index: for each feature hold (e.g. a sorted list of) the documents in which it is present. You can then work out the probability of B given A by running a merge on the sorted lists for features A and B.

For the "common features expected given A" question, I can think of nothing better than pre-computing the answer for each A and hoping that the resulting list of features isn't too long.

mcdowella