ansaurus

Question

Intentional Hashing Collisions

Answer 1

+1 A:

You can check out MinHash, which is a probablistic method for cases where items are defined by set membership.

I know you want to design your own hash function, but maybe this is what you are looking for.

bayer 2009-11-05 16:21:32

Answer 2

+1 A:

Let's take SOUNDEX, as an illustration of what you may be looking for...

it does hash distinct but similar keys with the same value
it would allow asserting that entity A (say, "McDonald") is possibly the same as entity C ("MacDonnel")

One characteristic of SOUNDEX is that it works well in a specific domain (that of names, particular family names) and that it leverages rules associated with the pronunciation of words [in the English language, and by extension many languages of the same origin].

Your hashing (or is it almost a form of indexing?) algorithm will only be successful if a [relatively simple] set of rules exist (or can be discovered) to express the "sameness" of the items/records considered. For example if the underlying database pertains to automobiles, and if the criteria for "sameness" is that of size and general performance, maybe the hashing algorithm should be based on attributes from the record such as price (converted to a range), number of cylinders, number of doors, and maybe estimated gas mileage (too converted to a range).

In a nutshell, I hope the above illustrates the need to tailor the algorithm to the semantics we wish to associate with hash value identity (or proximity... hence looking more and more like an index...), as well as these semantics are represented within the available data from the items.

Items may be similar along many very dimensions. It is a matter of defining what these dimensions are are how the attributes "on" this dimension can be used as keys to a hashing function.

On CMPH and gperf...
These are implementations of perfect, optionally mimimal, hash functions. This kind of function would allow you to prevent collisions. Not what is needed here (I think)

mjv 2009-11-05 16:50:10

Answer 3

A:

How collision-free does the hash have to be (in terms of false-positives)?

What about:

sorting the numbers, then running a standard hash on that sequence?
Or, treat each rows/columns as a set and hash the contents of each using addition. Then sort those results and run a standard hash across the sorted sums?
Or, the product of 1 and 2.

For example, here it is in Java (quick & dirty, but you get the idea):

import java.util.*;

/**
 *
 * @author Mark Bolusmjak
 */
public class MatrixTest {


  LinkedList<LinkedList<Integer>> randomMatrix(int size){
    LinkedList<LinkedList<Integer>> rows = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<size; i++){
      LinkedList<Integer> newRow = new LinkedList<Integer>();
      for (int j=0; j<size; j++){
        newRow.add((int)(5*Math.random()));
      }
      rows.add(newRow);
    }
    return rows;
  }

  LinkedList<LinkedList<Integer>> trans(LinkedList<LinkedList<Integer>> m){
    if (Math.random()<0.5){ //column translation
      for (LinkedList<Integer> integers : m) {
        integers.addFirst(integers.removeLast());
      }
    } else { //row translation
      m.addFirst(m.removeLast());
    }
    return m;
  }

  LinkedList<LinkedList<Integer>> flipDiagonal(LinkedList<LinkedList<Integer>> m){
    LinkedList<LinkedList<Integer>> flipped = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<m.size(); i++){
      flipped.add(new LinkedList<Integer>());
    }

    for (LinkedList<Integer> mRows : m) {
      Iterator<Integer> listIterator = mRows.iterator();
      for (LinkedList<Integer> flippedRows : flipped) {
        flippedRows.add(listIterator.next());
      }
    }
    return flipped;
  }


  public static void main(String[] args) {
    MatrixTest mt = new MatrixTest();
    LinkedList<LinkedList<Integer>> m = mt.randomMatrix(4);
    mt.display(m);

    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

  }


  private void display(LinkedList<LinkedList<Integer>> m){
    for (LinkedList<Integer> integers : m) {
      System.out.println(integers);
    }
    System.out.println("");
  }

  int hash1(LinkedList<LinkedList<Integer>> m){
    ArrayList<Integer> sorted = new ArrayList<Integer>();

    for (LinkedList<Integer> integers : m) {
      for (Integer integer : integers) {
        sorted.add(integer);
      }
    }
    Collections.sort(sorted);
    return sorted.hashCode();
  }

  int hash2(LinkedList<LinkedList<Integer>> m){
    List<Integer> rowColumnHashes = new ArrayList<Integer>();
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    m = flipDiagonal(m);
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    Collections.sort(rowColumnHashes);
    return rowColumnHashes.hashCode();
  }



} // end of class

z5h 2009-11-05 16:55:33

Answer 4

A:

In your example matrices, rotations and translations, etc seem to me to cover pretty much everything such that your algorithm would say that all matrices are equal. Can you give an example of sets of matrices that are different?

Also, it looks like you're doing something similar to image searching - algorithms for identifying images that are the same but possibly scaled, rotated (possibly arbitrarily), flipped, etc. I don't know much about any of this stuff, but maybe you could look to that research area for inspiration.

Also, I seem to remember something from Vector Calc about Eigenvalues which you might find useful.

uosɐſ 2009-11-05 17:18:56

Answer 5

A:

The chd algorithm implemented by cmph can do k-perfect hashing, which I think it is exactly what you are looking for:

http://cmph.sourceforge.net/chd.html

However, it would be better to know what you mean by "large input". If you are talking about hundreds of thousands entries, there are more straightforward solutions. If you have hundreds of millions then chd is probably your best bet.

Davi 2009-12-18 12:50:48

ansaurus

tags:

views:

answers:

Intentional Hashing Collisions

related questions