I have a huge dataset with words word_i
and weights weight[i,j]
,
where weight is the "connection strength" between words.
I'd like to binarize this data, but I want to know if there is any existing algorithm to make binary code of each word in such a way that the Hamming distance between the codes of the words correlates with this weight.
Added:
The problem I am working on is that I want to try to teach a neural net or SVM to make associations between words. And that's why I've decided to binarize data.
Don't ask why I don't want to use Markov models or just graphs, I've tried them and want to compare them with neural nets.
So,
I want my NN on given word "a" return its closest association or any set words and their probabilities,
I've tried to just binarize and make "ab" as input and weight as preferred answer, this worked badly,
I was thinking of making the threshold (for weights) to change 1 more bit. The smaller this threshold is, the more bits you require,
I have a situation: a->b w1; b->a w2; w1>>w2, so direction is significant.