views:

309

answers:

7

I have upto 10,000 randomly positioned points in a space and i need to be able to tell which the cursor is closest to at any given time. To add some context, the points are in the form of a vector drawing, so they can be constantly and quickly added and removed by the user and also potentially be unbalanced across the canvas space..

I am therefore trying to find the most efficient data structure for storing and querying these points. I would like to keep this question language agnostic if possible.

+2  A: 

Are the points uniformly distributed?

You could build a quad-tree up to a certain depth, say, 8. At the top you have a tree node that divides the screen into four quadrants. Store at each node:

  • The top left and the bottom right coordinate
  • Pointers to four child nodes, which divide the node into four quadrants

Build the tree up to a depth of 8, say, and at the leaf nodes, store a list of points associated with that region. That list you can search linearly.

If you need more granularity, build the quad-tree to a greater depth.

xcut
This sounds like the kind of thing i was thinking of, the points are not uniformly distributed however and the canvas size is also variable.. not that this discounts this method.
Tom
+6  A: 

The most efficient data structure would be a kd-tree link text

DiggerMeUp
Who ever voted this down could at least give a reason.
DiggerMeUp
I wonder why this is voted up, when OP wrote: "so they can be constantly and quickly changed by the user". KD-tree balancing would quickly become a nightmare.
MaR
@MaR I agree potentially the need for rebalancing could be an issue. I think is lessoned here because: 1)If the new points position is still within the same region then the tree would not need changing (each node would just need to store the original point and the current). 2)Only one point is altered at a time so there would be one removal and one insertion. 3)The tree would only need rebalancing if the vector drawing were changed into something completely different and the performance of the nearest neighbour search degraded too much. 4)Less of an issue in 2d. This would need testing.
DiggerMeUp
@MaR - the other point worth mentioning is Tom added the "so they can be constantly and quickly changed by the user" as an edit after I posted my answer although I still think a variant around KD-Tree may be best suited.
DiggerMeUp
Probably a dynamic variant of the kd-tree is the best choice. See e.g. http://www.daimi.au.dk/~large/Papers/bkdsstd03.pdf
martinus
+1  A: 

It depends on the frequency of updates and query. For fast query, slow updates, a Quadtree (which is a form of jd-tree for 2-D) would probably be best. Quadtree are very good for non-uniform point too.

If you have a low resolution you could consider using a raw array of width x height of pre-computed values.

If you have very few points or fast update, a simple array is enough, or may be a simple partitioning (which goes toward the quadtree).

So the answer depends on parameters of you dynamics. Also I would add that nowadays the algo isn't everything; making it use multiple processors or CUDA can give a huge boost.

Wernight
+4  A: 

After the Update to the Question

  1. Use two Red-Black Tree or Skip_list maps. Both are compact self-balancing data structures giving you O(log n) time for search, insert and delete operations. One map will use X-coordinate for every point as a key and the point itself as a value and the other will use Y-coordinate as a key and the point itself as a value.

  2. As a trade-off I suggest to initially restrict the search area around the cursor by a square. For perfect match the square side should equal to diameter of your "sensitivity circle” around the cursor. I.e. if you’re interested only in a nearest neighbour within 10 pixel radius from the cursor then the square side needs to be 20px. As an alternative, if you’re after nearest neighbour regardless of proximity you might try finding the boundary dynamically by evaluating floor and ceiling relative to cursor.

  3. Then retrieve two subsets of points from the maps that are within the boundaries, merge to include only the points within both sub sets.

  4. Loop through the result, calculate proximity to each point (dx^2+dy^2, avoid square root since you're not interested in the actual distance, just proximity), find the nearest neighbour.

  5. Take root square from the proximity figure to measure the distance to the nearest neighbour, see if it’s greater than the radius of the “sensitivity circle”, if it is it means there is no points within the circle.

  6. I suggest doing some benchmarks every approach; it’s two easy to go over the top with optimisations. On my modest hardware (Duo Core 2) naïve single-threaded search of a nearest neighbour within 10K points repeated a thousand times takes 350 milliseconds in Java. As long as the overall UI re-action time is under 100 milliseconds it will seem instant to a user, keeping that in mind even naïve search might give you sufficiently fast response.

Generic Solution

The most efficient data structure depends on the algorithm you’re planning to use, time-space trade off and the expected relative distribution of points:

  • If space is not an issue the most efficient way may be to pre-calculate the nearest neighbour for each point on the screen and then store nearest neighbour unique id in a two-dimensional array representing the screen.
  • If time is not an issue storing 10K points in a simple 2D array and doing naïve search every time, i.e. looping through each point and calculating the distance may be a good and simple easy to maintain option.
  • For a number of trade-offs between the two, here is a good presentation on various Nearest Neighbour Search options available: http://dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt
  • A bunch of good detailed materials for various Nearest Neighbour Search algorithms: http://simsearch.yury.name/tutorial.html, just pick one that suits your needs best.

So it's really impossible to evaluate the data structure is isolation from algorithm which in turn is hard to evaluate without good idea of task constraints and priorities.

Sample Java Implementation

import java.util.*;
import java.util.concurrent.ConcurrentSkipListMap;

class Test
{

  public static void main (String[] args)
  {

   Drawing naive = new NaiveDrawing();
   Drawing skip  = new SkipListDrawing();

   long start;

   start = System.currentTimeMillis();
   testInsert(naive);
   System.out.println("Naive insert: "+(System.currentTimeMillis() - start)+"ms");
   start = System.currentTimeMillis();
   testSearch(naive);
   System.out.println("Naive search: "+(System.currentTimeMillis() - start)+"ms");


   start = System.currentTimeMillis();
   testInsert(skip);
   System.out.println("Skip List insert: "+(System.currentTimeMillis() - start)+"ms");
   start = System.currentTimeMillis();
   testSearch(skip);
   System.out.println("Skip List search: "+(System.currentTimeMillis() - start)+"ms");

  }

  public static void testInsert(Drawing d)
  {
   Random r = new Random();
   for (int i=0;i<100000;i++)
   d.addPoint(new Point(r.nextInt(4096),r.nextInt(2048)));
  }

  public static void testSearch(Drawing d)
  {
   Point cursor;
   Random r = new Random();
   for (int i=0;i<1000;i++)
   {
    cursor = new Point(r.nextInt(4096),r.nextInt(2048));
    d.getNearestFrom(cursor,10);
   }
  }


}

// A simple point class
class Point
{
 public Point (int x, int y)
 {
  this.x = x;
  this.y = y;
 }
 public final int x,y;

 public String toString()
 {
  return "["+x+","+y+"]";
 }
}

// Interface will make the benchmarking easier
interface Drawing
{
 void addPoint (Point p);
 Set<Point> getNearestFrom (Point source,int radius);

}


class SkipListDrawing implements Drawing
{

 // Helper class to store an index of point by a single coordinate
 // Unlike standard Map it's capable of storing several points against the same coordinate, i.e.
 // [10,15] [10,40] [10,49] all can be stored against X-coordinate and retrieved later
 // This is achieved by storing a list of points against the key, as opposed to storing just a point.
 private class Index
 {
  final private NavigableMap<Integer,List<Point>> index = new ConcurrentSkipListMap <Integer,List<Point>> ();

  void add (Point p,int indexKey)
  {
   List<Point> list = index.get(indexKey);
   if (list==null)
   {
    list = new ArrayList<Point>();
    index.put(indexKey,list);
   }
   list.add(p);
  }

  HashSet<Point> get (int fromKey,int toKey)
  {
   final HashSet<Point> result = new HashSet<Point> ();

   // Use NavigableMap.subMap to quickly retrieve all entries matching
   // search boundaries, then flatten resulting lists of points into
   // a single HashSet of points.
   for (List<Point> s: index.subMap(fromKey,true,toKey,true).values())
    for (Point p: s)
     result.add(p);

   return result;
  }

 }

 // Store each point index by it's X and Y coordinate in two separate indices
 final private Index xIndex = new Index();
 final private Index yIndex = new Index();

 public void addPoint (Point p)
 {
  xIndex.add(p,p.x);
  yIndex.add(p,p.y);
 }


 public Set<Point> getNearestFrom (Point origin,int radius)
 {


    final Set<Point> searchSpace;
    // search space is going to contain only the points that are within
    // "sensitivity square". First get all points where X coordinate
    // is within the given range.
    searchSpace = xIndex.get(origin.x-radius,origin.x+radius);

    // Then get all points where Y is within the range, and store
    // within searchSpace the intersection of two sets, i.e. only
    // points where both X and Y are within the range.
    searchSpace.retainAll(yIndex.get(origin.y-radius,origin.y+radius));


    // Loop through search space, calculate proximity to each point
    // Don't take square root as it's expensive and really unneccessary
    // at this stage.
    //
    // Keep track of nearest points list if there are several
    // at the same distance.
    int dist,dx,dy, minDist = Integer.MAX_VALUE;

    Set<Point> nearest = new HashSet<Point>();

    for (Point p: searchSpace)
    {
    dx=p.x-origin.x;
    dy=p.y-origin.y;
    dist=dx*dx+dy*dy;

    if (dist<minDist)
    {
       minDist=dist;
       nearest.clear();
       nearest.add(p);
    }
    else if (dist==minDist)
    {
     nearest.add(p);
    }


    }

    // Ok, now we have the list of nearest points, it might be empty.
    // But let's check if they are still beyond the sensitivity radius:
    // we search area we have evaluated was square with an side to
    // the diameter of the actual circle. If points we've found are
    // in the corners of the square area they might be outside the circle.
    // Let's see what the distance is and if it greater than the radius
    // then we don't have a single point within proximity boundaries.
    if (Math.sqrt(minDist) > radius) nearest.clear();
    return nearest;
   }
}

// Naive approach: just loop through every point and see if it's nearest.
class NaiveDrawing implements Drawing
{
 final private List<Point> points = new ArrayList<Point> ();

 public void addPoint (Point p)
 {
  points.add(p);
 }

 public Set<Point> getNearestFrom (Point origin,int radius)
 {

    int prevDist = Integer.MAX_VALUE;
    int dist;

    Set<Point> nearest = Collections.emptySet();

    for (Point p: points)
    {
    int dx = p.x-origin.x;
    int dy = p.y-origin.y;

    dist =  dx * dx + dy * dy;
    if (dist < prevDist)
    {
       prevDist = dist;
       nearest  = new HashSet<Point>();
       nearest.add(p);
    }
    else if (dist==prevDist) nearest.add(p);

    }

    if (Math.sqrt(prevDist) > radius) nearest = Collections.emptySet();

    return nearest;
   }
}
Totophil
wouldn't looping through the array checking to see whether coordinates are within the sensitivity square be almost as intensive as a distance calc? four OR statements per point?
Tom
Distance calculation includes two multiplications, additon and most expensive square root (which you can avoid if you're intrested just in degree of closeness). Comparison can be up to four AND's per point but most of the time you'll end up with less than that (since if first fails the rest won't get evaluated and so on). You can also combine this "sensitivity" approach with some sort of tree index depending on what needs to be done more frequently: re-shuffle of point or proximity check.
Totophil
I am going to give the skip lists approach a go, your method seems clear to follow, thanks
Tom
Tom, just implemented in Java to give a try myself using Java standard ConcurrentSkipListMap, same test (thousand searches within 10K points) takes around 60-70 milliseconds, i.e. 5 times improvement. Would you be intrested in the code?
Totophil
Totophil, yep i would be interested in seeing it if you are offering, performance is really important. I guess i would also have to store a seperate structure for which points are joined with lines etc to create the drawing
Tom
+6  A: 
Matthijs Wessels
i have added more context to the question, the points take the form of a vector drawing, would this solution still be appropriate?
Tom
I've deleted my previous comment and added update time to my answer. Updating the data structure will take O(n) time I think. I still think that will be acceptable for a response to a user interaction.
Matthijs Wessels
There are algorithms for incremental update of Voronoi diagrams which take only O(log n) time per update http://www.springerlink.com/content/p8377h68j82l6860.
Keith Randall
I cannot access the paper you linked to at this moment (I'll have to do it at my university), but I think that doesn't cover just any update. I think it gives an average update time of O(log n) when constructing the diagram, resulting in O(n log n) total construction time. For normal updates this does not hold. Take for example a set of n points that all lie on a circle, add and remove one point in the middle and it will always take O(n) time because O(n) line segments have to be added/removed.
Matthijs Wessels
A: 

You haven't specified the dimensions of you points, but if it's a 2D line drawing then a bitmap bucket - a 2D array of lists of points in a region, where you scan the buckets corresponding to and near to a cursor can perform very well. Most systems will happily handle bitmap buckets of the 100x100 to 1000x1000 order, the small end of which would put a mean of one point per bucket. Although asymptotic performance is O(N), real-world performance is typically very good. Moving individual points between buckets can be fast; moving objects around can also be made fast if you put the objects into the buckets rather than the points ( so a polygon of 12 points would be referenced by 12 buckets; moving it becomes 12 times the insertion and removal cost of the bucket list; looking up the bucket is constant time in the 2D array ). The major cost is reorganising everything if the canvas size grows in many small jumps.

Pete Kirkham
A: 

If it is in 2D, you can create a virtual grid covering the whole space (width and height are up to your actual points space) and find all the 2D points which belong to every cell. After that a cell will be a bucket in a hashtable.

D_K