ansaurus

Question

Answer 1

A:

You can keep track of which documents you have already compared, e.g. (with numbers ;))

compared = set()

for i in [1,2,3]:
    for j in [1,2,3]:
        pair =  frozenset((i,j))
        if i != k and pair not in compared:
            compare.add(pair)
            compare(i,j)

Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.

Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:

docs = [1,2,3]
l = len(docs)
for i in range(l):
    for j in range(i+1,l):
        compare(l[i],l[j])

Felix Kling 2010-03-11 15:37:19

My solution is much faster. No overhead of the pair stuff.

Hogan 2010-03-11 15:44:55

Doesn't this implementation turn the complexity from an n^2 to an n^3? Not only do you have the nested loop, but the "pair not in compared" is an O(n) operation. There are easier/more efficient implementations. But it seems like @seanieb is satisfied.

Traveling Tech Guy 2010-03-11 15:47:30

@Traveling Tech Guy: True, I changed from lists to `set` and `frozenset`. I think, a set has access of O(1) as it is implemented as dictionary keys.

Felix Kling 2010-03-11 15:56:31

Answer 2

+2 A:

I don't think its complicated enough to qualify for a name.

You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.

Unique pairings:

SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id

Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.

C.

symcbean 2010-03-11 15:41:03

Answer 3

+3 A:

Assume all items have a number ItemNumber

Simple solution -- always have the 2nd element's ItemNumber greater than the first item.

eg

for (firstitem = 1 to maxitemnumber)
  for (seconditem = firstitemnumber+1 to maxitemnumber)
    compare(firstitem, seconditem)

visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.

........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.

Hogan 2010-03-11 15:44:27

I thought about that too, but maybe the OP does not want to change model classes.

Felix Kling 2010-03-11 15:47:14

@Felix : Cool... tell me what you mean by that? Model class? What is that exactly?

Hogan 2010-03-11 15:50:14

Well I assume that the OP somehow use a class to represent documents (the model) (maybe he uses an ORM). But I just realized that this is easy to do with list indexes. First I thought you mean to make the documents comparable. I still think you could provide a better example ;). But nevermind.

Felix Kling 2010-03-11 16:00:54

@Felix : I get it, you were implementing a solution for sets as op. to arrays / collections with a unique item number. You solution is of course the way to do it with sets. Rare that you don't have an underlying unique index to an array, but it could happen -- mostly with functional languages and non-relational DBs.

Hogan 2010-03-11 17:53:08

Answer 4

A:

Something like this?

src = [1,2,3]
for i, x in enumerate(src):
    for y in src[i:]:
        compare(x, y)

Or you might wish to generate a list of pairs instead:

pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

medina 2010-03-11 17:32:47

ansaurus

tags:

views:

answers:

Compare DB row values efficiently

related questions