views:

1066

answers:

8

My current code is pretty quick, but I need to make it even faster so we can accommodate even more markers. Any suggestions?

Notes:

  • The code runs fastest when the SQL statement is ordered by marker name - which itself does a very partial job of clustering the markers (the names of markers in the same location are often, but not always similar).
  • I can't pre-cluster the markers, because they can be dynamically searched and filtered.
  • I've tried grid-based clustering - but the results often aren't very nice.
  • I know that the clusters are slightly skewed on a Mercator projection.
  • I'm not interested in a commercial clustering service.

The code:

$singleMarkers = array();
$clusterMarkers = array();

while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare marker against all remaining markers.
    foreach ($markers as $key => $compareMarker) {
        // This function returns the distance between two markers, at a defined zoom level.
        $pixels = pixelDistance($marker['lat'], $marker['lng'], $compareMarker['lat'], $compareMarker['lng'], $zoomLevel);
        // If two markers are closer than defined distance, remove compareMarker from array and add to cluster.
        if ($pixels < $distance) {
            unset($markers[$key]);
            $cluster[] = $compareMarker;
        }
    }

    // If a marker was added to cluster, also add the marker we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return sqrt(pow(($x1-$x2),2) + pow(($y1-$y2),2)) >> (21 - $zoom); //21 is the max zoom level
}

UPDATE

Here's the current code:

$singleMarkers = array();
$clusterMarkers = array();

// Minimum distance between markers to be included in a cluster, at diff. zoom levels
$DISTANCE = (10000000 >> $ZOOM) / 100000;

// Loop until all markers have been compared.
while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare against all markers which are left.
    foreach ($markers as $key => $target) {
        $pixels = abs($marker['lat']-$target['lat']) + abs($marker['lng']-$target['lng']);

        // If the two markers are closer than given distance remove target marker from array and add it to cluster.
        if ($pixels < $DISTANCE) {
            unset($markers[$key]);
            $cluster[] = $target;
        }
    }

    // If a marker has been added to cluster, add also the one we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}
+1  A: 

It seems like speeding up the pixelDistance() function could be part of your solution, since it runs inside the loop. This is a good place to look first, but you didn't include that code, so I can't be sure.

John Fisher
Good point - I added the function. In pixelDistance() I previously converted the lat/lon values into pixel values - but I discovered I could speed it up if I just clustered using the straight up lat/lon values. Even though it skews the clusters a little bit.
Chris B
+2  A: 

Expanding on what John said, I think you should try inlining that function. Function calls in PHP are very slow, so you should get a decent speedup from that.

ryeguy
Awesome! This was actually fairly significant.
Chris B
A: 

I can see two more possible improvements here:

  • Can you just loop through $markers with a for loop instead of popping them off the array? The array popping is completely unneeded - you should really only be using arrays as queues if you're adding and removing items to them at the same time (which you aren't; you're just processing them then throwing them away)

  • You should try calculating the count() of the arrays at the beginning, and then manually increase or decrease a $count variable. Recalculating the size of an array each loop is wasteful.

ryeguy
Recalculating the size of the array each loop is indeed wasteful, but I don't see how I can get rid of it, with the current structure. The array may change significantly on a given loop - as I check the popped marker against each of the remaining markers, any of them may be added to a cluster and thus be removed from the array. At high zoom levels, all or most of the markers in the array will be removed and placed into a cluster, leaving no markers left to check. This significantly reduces the number of loops necessary to complete. I hope that made sense. :)
Chris B
+1  A: 

Do you actually need to calculate the Euclidean distance? If you are just comparing relative magnitudes of distances, you can probably get away with using the Manhattan distance, which will save you two calls to pow() and one to sqrt():

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return ($x1-$x2) + ($y1-$y2) >> (21 - $zoom);
}

Not sure if you need the >> (21 - $zoom) bit for your calculations, so I left it in. But unless you actually need to use the calculated distance values elsewhere, you can probably get away with just using the latitude/longitude directly (no need to multiply by anything) and taking the Manhattan distance, assuming you pre-calculate $distance to fit in with that measure, which will be a lot cheaper in computational terms than coercing all the distances to fit in with the units and magnitude of $distance.

EDIT: When I was researching this problem last year, I found some useful stuff on Wikipedia - yes, it can happen ;-)

I can also highly recommend the book Programming Collective Intelligence: Building Smart Web 2.0 Applications which goes into clustering in great depth, as applied not only to geographical data but also to other areas of data analysis.

NickFitz
This worked great. The >> (21 - $zoom) is how I worked the zoom level into the equation - but I decided it would be faster to get rid of it in the algorithm and calculate the qualifying $distance based on the zoom - so I only have to do it once. Thanks!
Chris B
Also, the equation does need to use absolute value: $pixels = abs($x1-$x2) + abs($y1-$y2);
Chris B
A: 

The following are some ideas you can implement in case performance is of great issue:

  • Reduce the dimensionality of the data: you have 2d data of long/lat, perhaps you can try projecting it down to 1D using something like Multidimensional Scaling (MDS) which tries to reduce the number of dimensions while preserving the distances in the original space, thus the distance function will only have to deal with one feature instead of two. An alternative way is to use Principal component analysis (PCA)
  • Faster search: the step of computing the distance to each marker can be improved using KD-trees.

Both of these technique are applied in an offline setting, so they are usually computed once and then used many times..

Amro
Do you know of any good examples of MDS or KD-trees?
Chris B
+1  A: 

A simple optimization would be to take advantage that sqrt(x) < sqrt(y) is true iff x < y, so you can omit the sqrt in the pixelDistance and calculate $distance squared outside the loop. You can also calculate the 21 - $zoomLevel outside the loop, you'll have to multiply it by 2 if you're comparing the squared values. Another small optimization would be to save 2 multiplies by doing $x1-$x2 before scaling by 10000000. And for a tiny bit more, storing the delta in a var and multiplying it by itself is probably faster than the pow function. And for some more you can inline the pixeldistance function. This kind of optimization will only yield a constant factor speedup.

For a larger speedup you'll need some kind of acceleration datastructure. An easy one would be to bin markers into distance sized squares. Then you can run over the bins look for markers to cluster with in only the same bin and 3 others chosen depending on which side of the center of the bin the marker falls. This will result in linear time clustering that will beat any optimizations to the quadratic algorithm for larger resultsets.

Ants Aasma
Postponing the 10000000 scale was a good idea, I'm not sure what you mean by calculating the distance outside the loop. I need to check the distance between each marker - and the loop is where that happens. Also, while the bin idea would certainly help the speed, I'm not willing to do it as it significantly decreases the quality of the clusters.
Chris B
The bins will yield exactly equivalent results to the current algorithm. Binning will avoid having to calculate the distance of markers that cannot possibly be close enough.
Ants Aasma
Ok, I think I understand what you're saying now - even if two close markers are divided by a bin boundary, it won't matter because I can check the adjacent bins. I'll see if I can figure out how to implement this.
Chris B
I was unable to find a way to implement this that was faster than what I have. There's probably a way to do it... :(
Chris B
I'm not sure what's the reason you can't get a speedup. I tried a quick implementation in Python. For 10k points I got about 100x speedup (9.4s vs 95ms). Dataset is randomly generated 1000 locations, in each location 10 points randomly offset by normal distribution, stdev 1/250 of the window, cluster radius 1/100 of the window. Result looks like this: http://www.flickr.com/photos/antsaasma/3963640089/
Ants Aasma
For kicks I tried QT clustering on top of the binned lookup structure, still a 10x speedup over the naive method, but significantly better clustering with on average 1.35x less clusters.
Ants Aasma
Would you mind sharing your implementation? It seems there might be a couple ways to do this, and I wasn't really to sure how to go about it.
Chris B
Here you go: http://gist.github.com/204365 I added some comments to make it easier to understand.
Ants Aasma
A: 

If you can, sort your markers by longitude on the initial search; then as soon as a marker is more than a marker width from the next marker in the sorted list, you definitely know that the remaining markers will not overlap, so you can then break the foreach loop and save yourself a ton of processing time. I have implemented this on my own site and it works very efficiently.

I have some source code in Python here.

Tristen
The problem I've seen with sorting by longitude (or latitude) is this: as I go through the list, I'll get all of the markers in one area, and then it will jump to markers in a completely different area (but with a just slightly higher longitude). Further down the list, I'll get some more markers which should have been in that first area, but weren't included because the longitude was higher than the markers in the second area, but still close enough to have been included in the first area cluster. Hope that made sense.
Chris B
Sounds like your algorithm for comparing whether one marker intersects with another is incorrect; I create a rectangle around each marker point, and test intersection with the next rectangle in the sorted list. As soon as the intersection test fails for a marker in the list, it is impossible for any further marker in the list to intersect, since it is sorted. These calculations are done in pixel space (I've converted the lat/lng to pixels as seen on the map).
Tristen
+1  A: 

So here's what I did - I added two extra columns to the markers(point) table with the mercator converted values for the latitude and longitude using the following functions:

public static $offset = 268435456;
public static $radius = 85445659.44705395; /* $offset / pi(); */

function LonToX($lon)
{
    return round(self::$offset + self::$radius * $lon * pi() / 180);        
}

function LatToY($lat)
{
    return round(self::$offset - self::$radius * log((1 + sin($lat * pi() / 180)) / (1 - sin($lat * pi() / 180))) / 2);
}

This way I could get accurately placed clusters. I'm still trying to work out how to avoid using array_pop and cycling through every time. So far its pretty efficient with sub-1000 markers. I'll be posting results for +5K and +10K markers later.

Avoiding the pixelDistance function and having it inline increases performance by almost half!

David Kobia
Actually - I can't add comments to the original question or to any other answer unless the answer was mine... weird. Anyhow, it looks you're running a modified version of Tuupola's code (http://github.com/tuupola/php_google_maps). I've profiled what you have so far and its definitely not efficient when you have thousands of points. I've struggled with various solutions, and it all still points back to pre-clustering - which is an issue if one wants to generate data dynamically.
David Kobia
I'm actually not converting lat and lon values to x and y values - I just cluster them as they are. This means that the clusters will get more skewed as they get further away from the equator, but it's hardly noticeable. I also used the advice above to calculate Manhattan distance (not Euclidean), which helped to speed things up. It did reduce the accuracy of the clusters a bit, but again - it's hardly noticeable. Right now we're running about 5K markers pretty quick, but that number is getting bigger.
Chris B
The Manhattan distance works pretty well too - I've been playing with that. I suppose accuracy is not an issue if you're dealing with a cluster. I'm just wondering what you might have used for $distance based on the zoomLevel.
David Kobia
David, I updated my code to show how I calculate distance now.
Chris B
I've successfully tested with 10K markers by adding a viewport query to restrict the clustering only to the bounds of the viewport. Next stop 20K
David Kobia