views:

163

answers:

4

So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.

What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], 
  [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
  ...

and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']] since it's a complete subset of the first example.

I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.

Thanks.

A: 

How many documents? How many unique words? How much RAM do you have?

What do you want to produce in the following scenario: document A has words 1, 2, 3; B has 1, 2; C has 2, 3

John Machin
Something like: [[a, b], [1, 2]], [[a,c][2, 3]]Don't care about the single doc or single words. Would like solution to handle 1000+ docs and could live with 100 words or less, averaging probably 30 unique words/doc. RAM -> whatever it takes... less the better. Can provide min of 1-2GB, no problemo.
ssweens
+3  A: 
  1. Normalize the text. You only want strings made of string.lowercase. Split/strip on everything else.
  2. Make sets out of this.
  3. Use something like this to get all possible groupings of all sizes:

    def get_all_lengths_combinations_of(elements):
      for no_of_items in range(2, len(elements)+1):
        for items in itertools.combinations(elements, no_of_items)
            yield items    
    

    I'm sure the real itertools wizards will come up with something better, possibly involving izip().

  4. Remember you should be able to use the set.intersection() method like this:

    set.intersection(*list_of_sets_to_intersect)
    
badp
...duh :D Thank you so much @Sebastian, I was fitting so well in my complicator's gloves :D
badp
Thanks bp. Started to land on a similar solution while waiting for answers, but this is nice and clean. I found that this starts to get unruly fairly fast after doc counts get past 10's and word matches get past 2 or 3 unfortunately and just crushes my poor little VM.
ssweens
@ssweens: Well, to name a number `len(list(get_all_lengths_combinations_of(range(10))))` is 1013. What you can do to help yourself is to use those results to save on computation. For example, once you calculate X = A ∩ B, you can calculate A ∩ B ∩ C as X ∩ C.
badp
Thanks bp. Appreciate the assist.
ssweens
+3  A: 

First, build a mapping from document ID to set of words -- your matrix of 0 and 1 is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):

def makemap(matrix):
  im = iter(matrix)
  words = next(im)[1:]
  themap = {}
  for row in im:
    mapent = set()
    docid = row[0]
    for w, bit in zip(words, row[1:]):
      try:
        if bit: mapent.add(w)
      except:
        print 'w is %r' % (w,)
        raise
    themap[docid] = mapent
  return themap

Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations for various lengths) will not perform any pruning of course.

I would start with all 2-combinations (all pairs of docids -- itertools.combinations is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuples or frozensets of docids as the keys.

Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N subsets (or even just the 2**N - N - 1 subsets of length at least two;-).

It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.

There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems in lieu of items, frozensets in lieu of tuples, etc -- they're probably marginal given those sequences are all O(N) vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).

def allfeasiblesubsets(matrix):
  mapping = makemap(matrix)
  docids = sorted(mapping.keys())
  feasible_len2 = {}
  dont_bother = dict((d, set([d])) for d in docids)
  for d1, d2 in itertools.combinations(docids, 2):
    commonw = mapping[d1].intersection(mapping[d2])
    if len(commonw) >= 2:
      feasible_len2[d1, d2] = commonw
    else:
      dont_bother[d1].add(d2)
      dont_bother[d2].add(d1)
  all_feasible = [feasible_len2]
  while all_feasible[-1]:
    feasible_Np1 = {}
    for ds, ws in all_feasible[-1].items():  
      md = max(ds)        
      for d, w in mapping.items():
        if d <= md or any(d in dont_bother[d1] for d1 in ds):
          continue
        commonw = w.intersection(ws)
        if len(commonw) >= 2:
          feasible_Np1[ds+(d,)] = commonw
    all_feasible.append(feasible_Np1)
  return all_feasible[:-1]

You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible* dicts as sorted tuples of docids (as the itertools.combinations originally provides for the pairs) to avoid duplications and therefore redundant work.

Here's the small example I've tried (in the same file as these functions after, of course, the import for itertools and collections):

mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
        ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
        ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]

mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs

The results, which appear OK, are:

{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]

but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.

(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).

Alex Martelli
This is really good. Thank you, Alex. Will definitely find use for at least some of this.I'm still tinkering with it as it seems to fall down on my 1GB vm once I throw over 100+ documents (100 words or less) at it. Rough profiling looks to get overwhelmed in the "while all_feasible[-1]:" loop, namely the "for d, w in mapping.items():" segment after 5 or 6 doc commonality.Will try some additional pruning work to get speed up or tinkering with a bitarray solution as Denis suggests. Am thinking bitwise has better chance of getting perf to a 1000+ document scale.
ssweens
+1  A: 
Denis
I'm following you, but get lost at the hit22, hit23... I think I skipped binary matrix math day in school... I get the concept and why 11 11, etc. but can't envision how one would take a bunch of bitarrays (one per doc) and implement a def match(hit22): function against those arrays...
ssweens
BTW - if you are truly from DE, nice thrashing of Australia today.
ssweens
@ssweens, forget the hit22 stuff for now, first get your feet wet.I added docs-words.py just to show the use of bitarray.BUT a semi-real doc classifier would take a few Klines,better to look around.Is your goal to learn the principles ?(Yes I live in Germany but I'm Canadian. You strine ?)
Denis
ssweens
And no, not Aussie.. US.. envious of their capacity for Foster's, but glad we have our football team. Hoping we don't end up playing Germany in the round of 16 after seeing that shellacking.
ssweens