views:

17

answers:

0

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:

  1. Create a 5-shingling of the two documents D1, D2
  2. Hash each shingle with a 64-bit hash
  3. Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
  4. For each document find the smallest of the resulting values
  5. If they match count it as a positive example, if not count it as a negative example
  6. Repeat 3. to 5. a few times
  7. Use positive_examples / total examples as the similarity measure

Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.