views:

1700

answers:

17

I need help selecting or creating a clustering algorithm according to certain criteria.

Imagine you are managing newspaper delivery persons.

  • You have a set of street addresses, each of which is geocoded.
  • You want to cluster the addresses so that each cluster is assigned to a delivery person.
  • The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
  • Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
  • For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

So... ideas?

UPDATE

The street network graph, as described in Arachnid's answer, is not available.

+1  A: 

You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.

hacken
+2  A: 

Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.

Chris Upchurch
+1  A: 

Good survey of simple clustering algos. There is more though: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

alphadogg
+5  A: 

I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.

You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.

I'm presuming that your real problem has nothing to do with delivering newspapers...

Simon
+2  A: 

Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.

sfossen
+3  A: 

I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).

Nick Johnson
that's nice and all but as stated in the question, the addresses are geocoded, that's all the information that is available. there is no graph of the street network.
carrier
Ah, I missed that.
Nick Johnson
A: 

I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:

when paperboys are:

  • Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
  • Removed: His locations are given to the other paperboys, using the closest locations to their routes.

when locations are:

  • Added : Same thing, the location is added to the closest route.
  • Removed: just removed from that boy's route.

Once a quarter, you could re-calculate the whole thing and change all the routes.

Osama ALASSIRY
He never mentionned anything about paperboy home locations and this doesn't address any of the key challenges to his problem (generating the clusters).
Mr Grieves
+6  A: 

What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).

However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...

If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.

For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.

Christoph
any good resources you can point me to?
carrier
+1  A: 

A trivial answer which does not get any bonus points:

One delivery person for each address.

JXG
AKA go buy your own damn paper! :P
therefromhere
+3  A: 

Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).

Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.

Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.

This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).

Nick Fortescue
+3  A: 

This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."

Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.

Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.

Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.

Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.

Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.

The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.

Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.

This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.

I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.

Bork Blatt
+1  A: 

I acknowledge that this will not necessarily provide clusters of roughly equal size:

One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005) What you do is:

Given a data set with n patterns.

  1. Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.

  2. Create a co-association matrix C of size n x n.

  3. For each partition p in the ensemble:
    3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.

  4. Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendogram as result in which we choose the clustering with the longest lifetime.

I'll provide descriptions of SL and k-means if you're interested.

+1  A: 
  • You have a set of street addresses, each of which is geocoded.
    • You want to cluster the addresses so that each cluster is assigned to a delivery person.
    • The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
    • Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
    • For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.

If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.

The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).

Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).

If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.

I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.

Ian
+3  A: 

This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.

I've used three different optimization algorithms to good effect on problems like this:

  1. Simulated Annealing
  2. Great Deluge Algorithm
  3. Genetic Algoritms

Using an optimization algorithm, I think you've described the following "goals":

  1. The geographic area for each paper boy should be minimized.
  2. The number of subscribers served by each should be approximately equal.
  3. The distance travelled by each should be about equal.
  4. (And one you didn't state, but might matter) The route should end where it began.

Hope this gets you started!

* Edit *

If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.

If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.

The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.

Steve Moyer
in my case, "The route should end where it began." does not apply. in fact, i don't care about assigning routes, just sets of addresses. they can take care of routing themselves.
carrier
Routeing have nothing to do with the actual way the take, just that route 1 is A ->B-C and route 2 is E->D>-G
Jonke
+1 for stating that the question is a OR (http://en.wikipedia.org/wiki/Operations_research)
Jonke
@carrier ... what if the cluster is bisected by a major interstate? Just dropping a cluster anywhere doesn't guarantee a routable solution ... see my edit based on the lack of those criteria
Steve Moyer
@steve moyer ... i don't have demographic information such as population density, subscription adoption/cancellation rates... what i have, is what i stated in the problem question
carrier
@carrier ... you have the current list ... which contains the starting subscription density for each area ... you can gather the other statistics over time.
Steve Moyer
+1  A: 

I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.

At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering

Brehtt
+9  A: 
eljenso
thanks for your answer... very much appreciated
carrier
+1 for effort, and by effort I mean pretty pictures
Ryan Graham
I'm going to submit a meta request to let me upvote twice, just to reward your awesome work here.
rwmnau
+1  A: 

This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.

UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.

For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.

There's an article about it here: http://abcnews.go.com/wnt/story?id=3005890

sjbotha