views:

220

answers:

0

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm.

I was using a C# implementation from this blog post, Set Similarity and Min Hash.

But while trying to use the implementation, I have run into 2 problems.

  • What value should I set universe value to?
  • When passing image byte array to HashSet, it only contains distinct byte values; thus comparing values from 1 ~ 256.

What is this universe in MinHash?
And what can I do to improve the C# MinHash implementation?

Since HashSet<byte> contains values upto 256, similarity value always come out to 1.

Here is the source that uses the C# MinHash implementation from Set Similarity and Min Hash:

class Program
{
    static void Main(string[] args)
    {
        var imageSet1 = GetImageByte(@".\Images\01.JPG");
        var imageSet2 = GetImageByte(@".\Images\02.TIF");
        //var app = new MinHash(256);
        var app = new MinHash(Math.Min(imageSet1.Count, imageSet2.Count));
        double imageSimilarity = app.Similarity(imageSet1, imageSet2);
        Console.WriteLine("similarity = {0}", imageSimilarity);
    }

    private static HashSet<byte> GetImageByte(string imagePath)
    {
        using (var fs = new FileStream(imagePath, FileMode.Open, FileAccess.Read))
        using (var br = new BinaryReader(fs))
        {
            //List<int> bytes = br.ReadBytes((int)fs.Length).Cast<int>().ToList();
            var bytes = new List<byte>(br.ReadBytes((int) fs.Length).ToArray());
            return new HashSet<byte>(bytes);
        }
    }
}