views:

241

answers:

5

I have a set containing thousands of addresses. Assuming I can get the longitude and latitude of each address, how would I go about splitting the set into groups by proximity?

Further, I may want to retry the 'clustering' according to different rules:

  • N groups
  • M addresses per group
  • maximum distance between any address in a group

Advice?

+1  A: 

The "N groups" and "M addresses per group" constraints are mutually exclusive. One implies the other.

Christopher
couldn't you have N groups with different number of addresses in each group?
carrier
But it's not a constraint. It would be a result of the algorithm.
Mr Grieves
which is not a constraint? Anyway, if I say there must be M addresses per group, then probably I will end up with a known N number of groups. But if I specify that there must be N groups, M addresses per group may or may not be a consequence.
carrier
carrier- you make a valid point
Christopher
+8  A: 

You could try the k-means clustering algorithm.

Fabian Steeg
More generally, http://en.wikipedia.org/wiki/Data_clustering
Steve S
+3  A: 

You want vector quantization:

http://en.wikipedia.org/wiki/Vector_quantization

"It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms."

Here the vectors are the geographic coordinates of each address, and you can feed your algorithms with other parameters depending on your constraints (proximity, group size, number of groups...).

You can start with k-means, but from my experience a Voronoi based algorithm is more flexible. A good introduction here.

fbonnet
A: 
  1. Build a matrix of distances between all addresses.
  2. Starting with a random address, sort the matrix by ascending distance to that address
  3. Removing the addresses from the matrix as you go along, place the addresses closest to the start address into a new group until you reach your criteria (size of group or max distance).
  4. Once a group is full, choose another random address and resort the matrix by distance to that address
  5. Continue like this until all addresses are taken out of the matrix.

If addresses were distributed evenly, each group would have a sort of circular shape around the start address. The problem comes when start addresses are near existing groups. When this happens, the new group will sort of wrap around the old one and could even circle it completely if your stop criteria is only group size. If you use the max-distance constraint, then this is not going to happen (assuming no other constraints).

I don't really know if this is a good way of doing it but it's what I'd try. I'm sure lots of optimization would be required. Especially for addresses on the edges.

Mr Grieves
A: 

It depends a bit on the scale of the data you are wanting to cluster. The brute force approach is to calculate the distance between all combination of points into a distance array. The resulting array is N^2 and since the distance from A to B is the same as B to A you only need half those, so the resulting set is N^2/2.

For relatively close lat lon coordinates you can sometimes get away with using the lat long as an x,y grid and calculating the Cartesian distance. Since the real world is not flat the Cartesian distance will have error. For a more exact calculation which you should use if your addresses are located across the country, see this link from Mathforum.com.

If you don't have the scale to handle the entire distance matrix, you will need to do some algorithm programming to increase efficiency.

JD Long
Mr Grieves beat me to the same answer. That's what I get for going to work out after starting an answer then finishing the answer when i come back!
JD Long