views:

110

answers:

4

I want to loop through a database of documents and calculate a pairwise comparison score.

A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.

Is there a name for the algorithm for doing this task efficiently? Is there a name for this approach?

Thanks.

A: 

You can keep track of which documents you have already compared, e.g. (with numbers ;))

compared = set()

for i in [1,2,3]:
    for j in [1,2,3]:
        pair =  frozenset((i,j))
        if i != k and pair not in compared:
            compare.add(pair)
            compare(i,j)

Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.

Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:

docs = [1,2,3]
l = len(docs)
for i in range(l):
    for j in range(i+1,l):
        compare(l[i],l[j])
Felix Kling
My solution is much faster. No overhead of the pair stuff.
Hogan
Doesn't this implementation turn the complexity from an n^2 to an n^3? Not only do you have the nested loop, but the "pair not in compared" is an O(n) operation. There are easier/more efficient implementations. But it seems like @seanieb is satisfied.
Traveling Tech Guy
@Traveling Tech Guy: True, I changed from lists to `set` and `frozenset`. I think, a set has access of O(1) as it is implemented as dictionary keys.
Felix Kling
+2  A: 

I don't think its complicated enough to qualify for a name.

You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.

Unique pairings:

SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id

Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.

C.

symcbean
+3  A: 

Assume all items have a number ItemNumber

Simple solution -- always have the 2nd element's ItemNumber greater than the first item.

eg

for (firstitem = 1 to maxitemnumber)
  for (seconditem = firstitemnumber+1 to maxitemnumber)
    compare(firstitem, seconditem)

visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.

........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
Hogan
I thought about that too, but maybe the OP does not want to change model classes.
Felix Kling
@Felix : Cool... tell me what you mean by that? Model class? What is that exactly?
Hogan
Well I assume that the OP somehow use a class to represent documents (the model) (maybe he uses an ORM). But I just realized that this is easy to do with list indexes. First I thought you mean to make the documents comparable. I still think you could provide a better example ;). But nevermind.
Felix Kling
@Felix : I get it, you were implementing a solution for sets as op. to arrays / collections with a unique item number. You solution is of course the way to do it with sets. Rare that you don't have an underlying unique index to an array, but it could happen -- mostly with functional languages and non-relational DBs.
Hogan
A: 

Something like this?

src = [1,2,3]
for i, x in enumerate(src):
    for y in src[i:]:
        compare(x, y)

Or you might wish to generate a list of pairs instead:

pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]
medina