ansaurus

Question

How to find an distinct URL only in set A not in set B.

Answer 1

+3 A:

A decent algorithm might be:

load all of set A into a hashmap, O(a)

traverse set B, and for each item, delete the identical value from set A (from the hashmap) if it exists, O(b)

Then your hashmap has the result. This would be O(a+b) where a is size of set A and b is size of set B. (In practice, this would be multiplied by the hash time, which ideally corresponds to approximately O(1) for a good hash.)

JoshD 2010-09-29 01:41:25

Answer 2

+2 A:

Something perhaps a little naive might be a procedure like

Sort list A
Sort list B
Navigate list A and B together such that:

a. Increment pointer to A and pointer to B when elements match

b. Increment pointer to B until the element matches the next element in a or until the record b in B would appear after the next element in a (this rule discards elements in B that are not in A)

c. A match has been found when incrementing subject to these rules such that the next element b in B does not match the next element a in A.

This might actually be an interesting place to apply Bloom filters: construct a Bloom filter for set B then for every URL in set A determine if it is in set B. With diminishingly small probability of error you should be able to find all URLs in A not in B.

Mark E 2010-09-29 01:44:39

Answer 3

+1 A:

(sort -u A; cat B B) | sort | uniq -u

piccolbo 2010-10-05 08:07:47

ansaurus

tags:

views:

answers:

How to find an distinct URL only in set A not in set B.

related questions