tags:

views:

168

answers:

5

I'm trying to write some code that will do "fuzzy hashing". That is: I want several inputs to hash to the same output so that I can do searches etc quickly and easily. If A hashes to 1 and C hashes to 1, it will be trivial for me to find out that A is equivalent to C.

Designing such a hash function seems hard, so I was wondering if anyone had experience with CMPH or GPERF and could walk me through creating a function that would result in this hash function.

Thanks in advance! Stefan

@Ben

In this case, matrixes of booleans, but I can easily pack them into 64 bit integers. Rotations, translations, etc in the input are irrelevant and need to be weeded out. Thus:

000
111
000

Is equivalent to

111
000
000

and

001
001
001

(simplification)

@Kinopiko

My best bet thus far would be to determine some sort of "canonical" representation and design code that terminates when the transformations reach such a representation (say...packing all the bits at the bottom). Yet this is slow and I'm looking for a better way. My data set is large.

@Jason

These two would not hash to the same value.

000
010
000

000
011
000
+1  A: 

You can check out MinHash, which is a probablistic method for cases where items are defined by set membership.

I know you want to design your own hash function, but maybe this is what you are looking for.

bayer
+1  A: 

Let's take SOUNDEX, as an illustration of what you may be looking for...

  • it does hash distinct but similar keys with the same value
  • it would allow asserting that entity A (say, "McDonald") is possibly the same as entity C ("MacDonnel")

One characteristic of SOUNDEX is that it works well in a specific domain (that of names, particular family names) and that it leverages rules associated with the pronunciation of words [in the English language, and by extension many languages of the same origin].

Your hashing (or is it almost a form of indexing?) algorithm will only be successful if a [relatively simple] set of rules exist (or can be discovered) to express the "sameness" of the items/records considered. For example if the underlying database pertains to automobiles, and if the criteria for "sameness" is that of size and general performance, maybe the hashing algorithm should be based on attributes from the record such as price (converted to a range), number of cylinders, number of doors, and maybe estimated gas mileage (too converted to a range).

In a nutshell, I hope the above illustrates the need to tailor the algorithm to the semantics we wish to associate with hash value identity (or proximity... hence looking more and more like an index...), as well as these semantics are represented within the available data from the items.

Items may be similar along many very dimensions. It is a matter of defining what these dimensions are are how the attributes "on" this dimension can be used as keys to a hashing function.

On CMPH and gperf...
These are implementations of perfect, optionally mimimal, hash functions. This kind of function would allow you to prevent collisions. Not what is needed here (I think)

mjv
A: 

How collision-free does the hash have to be (in terms of false-positives)?

What about:

  1. sorting the numbers, then running a standard hash on that sequence?
  2. Or, treat each rows/columns as a set and hash the contents of each using addition. Then sort those results and run a standard hash across the sorted sums?
  3. Or, the product of 1 and 2.

For example, here it is in Java (quick & dirty, but you get the idea):

import java.util.*;

/**
 *
 * @author Mark Bolusmjak
 */
public class MatrixTest {


  LinkedList<LinkedList<Integer>> randomMatrix(int size){
    LinkedList<LinkedList<Integer>> rows = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<size; i++){
      LinkedList<Integer> newRow = new LinkedList<Integer>();
      for (int j=0; j<size; j++){
        newRow.add((int)(5*Math.random()));
      }
      rows.add(newRow);
    }
    return rows;
  }

  LinkedList<LinkedList<Integer>> trans(LinkedList<LinkedList<Integer>> m){
    if (Math.random()<0.5){ //column translation
      for (LinkedList<Integer> integers : m) {
        integers.addFirst(integers.removeLast());
      }
    } else { //row translation
      m.addFirst(m.removeLast());
    }
    return m;
  }

  LinkedList<LinkedList<Integer>> flipDiagonal(LinkedList<LinkedList<Integer>> m){
    LinkedList<LinkedList<Integer>> flipped = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<m.size(); i++){
      flipped.add(new LinkedList<Integer>());
    }

    for (LinkedList<Integer> mRows : m) {
      Iterator<Integer> listIterator = mRows.iterator();
      for (LinkedList<Integer> flippedRows : flipped) {
        flippedRows.add(listIterator.next());
      }
    }
    return flipped;
  }


  public static void main(String[] args) {
    MatrixTest mt = new MatrixTest();
    LinkedList<LinkedList<Integer>> m = mt.randomMatrix(4);
    mt.display(m);

    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

  }


  private void display(LinkedList<LinkedList<Integer>> m){
    for (LinkedList<Integer> integers : m) {
      System.out.println(integers);
    }
    System.out.println("");
  }

  int hash1(LinkedList<LinkedList<Integer>> m){
    ArrayList<Integer> sorted = new ArrayList<Integer>();

    for (LinkedList<Integer> integers : m) {
      for (Integer integer : integers) {
        sorted.add(integer);
      }
    }
    Collections.sort(sorted);
    return sorted.hashCode();
  }

  int hash2(LinkedList<LinkedList<Integer>> m){
    List<Integer> rowColumnHashes = new ArrayList<Integer>();
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    m = flipDiagonal(m);
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    Collections.sort(rowColumnHashes);
    return rowColumnHashes.hashCode();
  }



} // end of class
z5h
A: 

In your example matrices, rotations and translations, etc seem to me to cover pretty much everything such that your algorithm would say that all matrices are equal. Can you give an example of sets of matrices that are different?

Also, it looks like you're doing something similar to image searching - algorithms for identifying images that are the same but possibly scaled, rotated (possibly arbitrarily), flipped, etc. I don't know much about any of this stuff, but maybe you could look to that research area for inspiration.

Also, I seem to remember something from Vector Calc about Eigenvalues which you might find useful.

uosɐſ
A: 

The chd algorithm implemented by cmph can do k-perfect hashing, which I think it is exactly what you are looking for:

http://cmph.sourceforge.net/chd.html

However, it would be better to know what you mean by "large input". If you are talking about hundreds of thousands entries, there are more straightforward solutions. If you have hundreds of millions then chd is probably your best bet.

Davi