How does Shingleprinting work in practice? | ansaurus

tags:

views:

17

answers:

0

Q:

How does Shingleprinting work in practice?

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:

Create a 5-shingling of the two documents D1, D2
Hash each shingle with a 64-bit hash
Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
For each document find the smallest of the resulting values
If they match count it as a positive example, if not count it as a negative example
Repeat 3. to 5. a few times
Use positive_examples / total examples as the similarity measure

Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.

related questions

In .NET, will empty method calls be optimized out?

Good Resources for Relational Database Design

How to overload std::swap()

What is good server performance monitoring software for Windows?

Unit test execution speed (how many tests per second?)

Has anyone used Jaxer in production?

Replicating load related crashes in non-production environments

ADO.NET Connection Pooling & SQLServer

Has anyone run performance benchmarks comparing LINQ

Best tool for performance testing ASP.NET

Has anybody used Google Performance Tools?

Improving Productivity of my Teams

Scaling multithreaded applications on multicored machines

Is String.Format as efficient as StringBuilder

Faster way to find duplicates conditioned by time

.NET Remoting Speed and VPNs

How much database performance overhead when using LINQ?

CSharpCodeProvider Compilation Performance

C# DataTable Loop Performance

FileHelpers performance

ConfigurationManager.AppSettings Performance Concerns

Why are SYSTEM and taskmgr.exe taking up 100% of my CPU?

Speed Comparisons - Procedural vs. OO in interpreted languages

Anatomy of a "Memory Leak"

Fastest way to get value of pi