I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:
- Create a 5-shingling of the two documents D1, D2
- Hash each shingle with a 64-bit hash
- Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
- For each document find the smallest of the resulting values
- If they match count it as a positive example, if not count it as a negative example
- Repeat 3. to 5. a few times
- Use
positive_examples / total examples
as the similarity measure
Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.