First, build a mapping from document ID to set of words -- your matrix of 0
and 1
is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):
def makemap(matrix):
im = iter(matrix)
words = next(im)[1:]
themap = {}
for row in im:
mapent = set()
docid = row[0]
for w, bit in zip(words, row[1:]):
try:
if bit: mapent.add(w)
except:
print 'w is %r' % (w,)
raise
themap[docid] = mapent
return themap
Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations
for various lengths) will not perform any pruning of course.
I would start with all 2-combinations (all pairs of docids -- itertools.combinations
is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuple
s or frozenset
s of docids as the keys.
Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N
subsets (or even just the 2**N - N - 1
subsets of length at least two;-).
It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.
There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems
in lieu of items
, frozenset
s in lieu of tuple
s, etc -- they're probably marginal given those sequences are all O(N)
vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).
def allfeasiblesubsets(matrix):
mapping = makemap(matrix)
docids = sorted(mapping.keys())
feasible_len2 = {}
dont_bother = dict((d, set([d])) for d in docids)
for d1, d2 in itertools.combinations(docids, 2):
commonw = mapping[d1].intersection(mapping[d2])
if len(commonw) >= 2:
feasible_len2[d1, d2] = commonw
else:
dont_bother[d1].add(d2)
dont_bother[d2].add(d1)
all_feasible = [feasible_len2]
while all_feasible[-1]:
feasible_Np1 = {}
for ds, ws in all_feasible[-1].items():
md = max(ds)
for d, w in mapping.items():
if d <= md or any(d in dont_bother[d1] for d1 in ds):
continue
commonw = w.intersection(ws)
if len(commonw) >= 2:
feasible_Np1[ds+(d,)] = commonw
all_feasible.append(feasible_Np1)
return all_feasible[:-1]
You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother
only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible*
dicts as sorted tuples of docids (as the itertools.combinations
originally provides for the pairs) to avoid duplications and therefore redundant work.
Here's the small example I've tried (in the same file as these functions after, of course, the import
for itertools
and collections
):
mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]
mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs
The results, which appear OK, are:
{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]
but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.
(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).