views:

141

answers:

1

I have a huge dataset with words word_i and weights weight[i,j], where weight is the "connection strength" between words.

I'd like to binarize this data, but I want to know if there is any existing algorithm to make binary code of each word in such a way that the Hamming distance between the codes of the words correlates with this weight.

Added:
The problem I am working on is that I want to try to teach a neural net or SVM to make associations between words. And that's why I've decided to binarize data. Don't ask why I don't want to use Markov models or just graphs, I've tried them and want to compare them with neural nets.

So,

  1. I want my NN on given word "a" return its closest association or any set words and their probabilities,

  2. I've tried to just binarize and make "ab" as input and weight as preferred answer, this worked badly,

  3. I was thinking of making the threshold (for weights) to change 1 more bit. The smaller this threshold is, the more bits you require,

  4. I have a situation: a->b w1; b->a w2; w1>>w2, so direction is significant.

+1  A: 

Hi,

what you can do is to use a self-organizing map (SOM) with the topology of fixed length, say N-bit, words, so that e.g. if N=8 then every cell in the SOM has exactly 8 neighbors (those where one bit has been flipped). Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. You can then calculate the "distance" between two arbitrary vectors a1...aK and b1...bK by summing over

 i,j : ai * bj * distance(ai, bj)

which gives you the distance metric for running the SOM algorithm. When the SOM has stabilized, [dictionary] words near to each other in your metric are near to each other in the topology of the map, from which you get the encoding trivially as [binary] words.

Note that the map must have more cells than there are words, i.e. 2**N > K.

This answer of course assumes background with self organizing maps. See http://en.wikipedia.org/wiki/Self-organizing_map

antti.huima
>>. Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. Does it mean I'll have number of bits==number of words? Wouldn't it be better to encode like general binary coding, having number of bits=log(N), N-number of dict words?Or do you mean to encode it in matrix (N*N), where wij is weight(ai,bj). In such case, having every row as an example for SOM, number of examples==number of variables.Thank you!
Ivri
I've solved the problem using MDS (multi-dimensional scaling)
Ivri