views:

190

answers:

6

Now, there are lots of ordered numeric vectors(about 50 dimension). Each dimension is a integer between 0~200. I want to sort them to ensure the similar vectors in the same bucket, all vectors in adjacent buckets have also a similarity to a certain extent. e.g. <1,24,25,78,9> and <2,24,24,78,9> should be in same bucket(bucket number is 010), but <3,29,26,74,11> and <4,28,29,75,10> (they are also in same bucket)is in adjacent buckets (bucket number is 011)

how to design such a sorting function?

A: 

What you describe is not much like a "good" hashing function in the normal sense.

Can you be more specific about how you get from those particular sequences to those bucket numbers (010, 011)? Your hash function is basically going to be that same process.

Tim Sylvester
+1  A: 

It sounds like you're trying to sort your vectors by locality.

Perhaps what you really want is a generalization of a QuadTree. If each vector of length R is considered a coordinate in R-Space (R==2 -> 2d plane space, r==3 -> 3d space, etc), then you can divide that R-cube in half along each dimension, to get 2R inner R-hyper cubes. continue this process any time a cube contains more than 1 un-equal coordinates. You can then traverse this tree of R-hypercubes to efficiently locate neighboring vectors.

TokenMacGuy
But for construcing the R-hypercubes, it is necessary to calculate the similarity/distance between any two vectors? It's time consuming for a large number of vectors.
you dont calculate distances, you partition the space into at least n hypercubes for each n vectors. you do the comparision on each dimension seperately.
TokenMacGuy
+2  A: 

You want a Morton code. It interleaves the bits of each dimension to help keep similar values together. It's often done for 2 and 3D, but it works for any dimension.

Express each of the D values as binary word of B bits, then interleave the bits to form a new D*B bit long number. That's your lookup table number. If you want a smaller number, discard lower bits to get fewer bins.

An even better (but much more annoying to compute) function is a multidimensional Hilbert curve mapping. This is very difficult to work with in practice, but it does have pretty much the best indexing locality you can get.

SPWorley
i've been using this spacefilling curve without knowing what it was called (or that it had already been invented). thanks.
TokenMacGuy
A: 

What you want is pretty much the exact opposite of a hash: for hashing, you want to avoid having similar values end up in the same bucket.

It seems like you're looking for a spatial index; R-Trees sounds the most promising.

Michael Borgwardt
A: 

Sounds like point clustering. And it's a very active field. Try googling "vector qantization", k-means clustering, dendrograms, ...

Ray
I know it's like clustering, but clustering is time consuming. And the number of vectors in system is dynamic. perhaps there are more than 1000 new vectors inserted into the system. So the clustering has to redo.
A: 

Perhaps this example is not good. The similarity I wish is that the cosine of angles between two vectors is small. In fact, i want to transform a text to a vector where each dimension corresponds to a concept in the text. I suppose all text have same number of concepts so that their vector dimension are same. – cooky

I assume you mean the cosine is close to 1, so they're both going in close to the same direction. A cosine close to zero means the angle is about orthogonal.

And by the way, if you're doing that, then that means that, say (1,2,3,4,5) and (10,20,30,40,49) should be very close together (despite the actual numbers being far apart). So you want to normalize your vectors prior to using any of the spatial indexing methods.

Alex319