views:

1243

answers:

12

A point in 3-d is defined by (x,y,z). Distance d between any two points (X,Y,Z) and (x,y,z) is d= Sqrt[(X-x)^2 + (Y-y)^2 + (Z-z)^2]. Now there are a million entries in a file, each entry is some point in space, in no specific order. Given any point (a,b,c) find the nearest 10 points to it. How would you store the million points and how would you retrieve those 10 points from that data structure.

+6  A: 

This isn't a homework question, is it? ;-)

My thought: iterate over all points and put them into a min heap or bounded priority queue, keyed by distance from the target.

David Zaslavsky
its an interview question
Kazoom
sure, but it is not known what the target is. :)
Unreason
+15  A: 

If the million entries are already in a file, there's no need to load them all into a data structure in memory. Just keep an array with the top-ten points found so far, and scan over the million points, updating your top-ten list as you go.

This is O(n) in the number of points.

Will
This will work well, but the array is not the most efficient data store, because you have to check over it each step, or keep it sorted, which can be a hassle. David's answer about a min-heap does that stuff for you, but is otherwise the same solution. When the user only wants 10 points, these concerns are negligible, but when the user suddenly wants the nearest 100 pts, you will be in trouble.
Karl
@ Karl: The question specifies 10 points. I think including this detail is deliberate on the part of the asker. So, Will describes a very good solution for what was asked.
Nixuz
@Karl, it is often surprising how good the compiler and the CPU is at optimizing the dumbest of loops to beat the cleverest of algorithms. Never underestimate the speedup to be gained when a loop can run in on-chip ram.
Hightechrider
Million entries are not already in the file - you can choose how to store them in the file. :)This choice on how to store it implies that you can also precompute any accompanying indexing structure. Kd-tree will win as it will not have to read the whole file at all < O(n).
Unreason
I've posted implementation of your answer http://stackoverflow.com/questions/2486093/millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point/2486341#2486341 (though I use heap instead of linear search and it is completely unnecessary for the task)
J.F. Sebastian
Karl is right, a heap is better. *Use a heap.*
FogleBird
@FogleBird: For 10 points linear search *is* better than heap (1.11-1.12 vs. 1.15-1.18 seconds for the input I've tested). For N=10 a constant factor in algorithm does matter therefore O(N) algorithm can be faster than O(log(N)) algorithm.
J.F. Sebastian
1.11 vs 1.15 is pretty slim. I'd code it with a heap anyway in case I ever wanted to do more than 10. Of course I don't know what the OP wants to do with it.
FogleBird
+2  A: 

Straightforward algorithm:

Store the points as a list of tuples, and scan over the points, computing the distance, and keeping a 'closest' list.

More creative:

Group points into regions (such as the cube described by "0,0,0" to "50,50,50", or "0,0,0" to "-20,-20,-20"), so you can "index" into them from the target point. Check which cube the target lies in, and only search through the points in that cube. If there are less than 10 points in that cube, check the "neighboring" cubes, and so on.

On further thought, this isn't a very good algorithm: if your target point is closer to the wall of a cube than 10 points, then you'll have to search into the neighboring cube as well.

I'd go with the kd-tree approach and find the closest, then remove (or mark) that closest node, and re-search for the new closest node. Rinse and repeat.

Jeff Meatball Yang
+1  A: 

basically a combination of the first two answer above me. since the points are in a file there's no need to keep them in memory. Instead of an array, or a min heap, I would use a max heap, since you only want to check for distances less than the 10th closest point. For an array, you would need to compare each newly calculated distance with all 10 of the distances you kept. For a min heap, you have to perform 3 comparisons with every newly calculated distance. With a max heap, you only perform 1 comparison when the newly calculated distance is greater than the root node.

Yiling
+12  A: 

You could store the points in a k-dimensional tree (kd-tree). Kd-trees are optimized for nearest-neighbor searches (finding the n points closest to a given point).

mipadi
I think an octree is called for here.
Gabe
The complexity required to build a K-d tree is going to be higher than the complexity required to do a linear search for the 10 closet points. The real power of a k-d tree comes when you are going to do many queries on a point set.
Nixuz
@gabe: One big advantage of a kd-tree is that while it'll take some memory overhead to build, once you have it a left-balanced kd-tree takes no more memory than the original unstructured list of point. An octree will definitely need at least some memory overhead.
Boojum
kd-tree can be slower in real life than the brute-force approach http://stackoverflow.com/questions/2486093/millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point/2486341#2486341
J.F. Sebastian
This is the answer I would give in an interview. It's not rare for interviewers to use less-than-precise language, and reading between the lines this seems to be most likely what they're looking for. In fact if I were the interviewer, and someone gave the answer "I would store the points in any old order, and do a linear scan to find the 10 points" and justified that answer based on my imprecise wording, I would be pretty unimpressed.
Jason Orendorff
@ Jason Orendorff: I'd definitely discuss using a kd-tree for such a problem in a technical interview; however, I would also explain why for the specific problem given, the simpler, linear search method will not only be asymptoticly faster, but will run faster too. This will show a deeper understanding of the complexity of algorithms, knowledge of data structures, and the ability to consider different solutions to a problem.
Nixuz
A: 

Calculate the distance for each of them, and do Select(1..10, n) in O(n) time. That would the naive algorithm I guess.

Rubys
+4  A: 

This question is essentially testing your knowledge and/or intuition of space partitioning algorithms. I'd say that storing the data in an octree is your best bet. It's commonly used in 3d engines that handle just this kind of problem (storing millions of vertices, ray tracing, finding collisions, etc.). The lookup time will be on the order of log(n) in the worst case scenario (I believe).

Kai
+21  A: 

Million points is a small number. The most straightforward approach works here (code based on KDTree is slower (for querying only one point)).

Brute-force approach (time ~1 second)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point = numpy.random.uniform(0, 100, NDIM) # choose random point
print 'point:', point
d = ((a-point)**2).sum(axis=1)  # compute distances
ndx = d.argsort() # indirect sort 

# print 10 nearest points to the chosen one
import pprint
pprint.pprint(zip(a[ndx[:10]], d[ndx[:10]]))

Run it:

$ time python nearest.py 
point: [ 69.06310224   2.23409409  50.41979143]
[(array([ 69.,   2.,  50.]), 0.23500677815852947),
 (array([ 69.,   2.,  51.]), 0.39542392750839772),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  51.]), 0.9272357402197513),
 (array([ 70.,   2.,  50.]), 1.1088022980015722),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   3.,  51.]), 1.801031260062794),
 (array([ 69.,   1.,  51.]), 1.8636121147970444)]

real    0m1.122s
user    0m1.010s
sys 0m0.120s

Here's the script that generates million 3D points:

#!/usr/bin/env python
import random
for _ in xrange(10**6):
    print ' '.join(str(random.randrange(100)) for _ in range(3))

Output:

$ head million_3D_points.txt

18 56 26
19 35 74
47 43 71
82 63 28
43 82 0
34 40 16
75 85 69
88 58 3
0 63 90
81 78 98

You could use that code to test more complex data structures and algorithms (for example, whether they actually consume less memory or faster then the above simplest approach). It is worth noting that at the moment it is the only answer that contains working code.

Solution based on KDTree (time ~1.4 seconds)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point =  [ 69.06310224,   2.23409409,  50.41979143] # use the same point as above
print 'point:', point


from scipy.spatial import KDTree

# find 10 nearest points
tree = KDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query([point], k=10)

# print 10 nearest points to the chosen one
print a[ndx]

Run it:

$ time python nearest_kdtree.py  

point: [69.063102240000006, 2.2340940900000001, 50.419791429999997]
[[[ 69.   2.  50.]
  [ 69.   2.  51.]
  [ 69.   3.  50.]
  [ 69.   3.  50.]
  [ 69.   3.  51.]
  [ 70.   2.  50.]
  [ 70.   2.  51.]
  [ 70.   2.  51.]
  [ 70.   3.  51.]
  [ 69.   1.  51.]]]

real    0m1.359s
user    0m1.280s
sys 0m0.080s

Partial sort in C++ (time ~1.1 seconds)

// $ g++ nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>
#include <iostream>
#include <vector>

#include <boost/lambda/lambda.hpp>  // _1
#include <boost/lambda/bind.hpp>    // bind()
#include <boost/tuple/tuple_io.hpp>

namespace {
  typedef double coord_t;
  typedef boost::tuple<coord_t,coord_t,coord_t> point_t;

  coord_t distance_sq(const point_t& a, const point_t& b) { // or boost::geometry::distance
    coord_t x = a.get<0>() - b.get<0>();
    coord_t y = a.get<1>() - b.get<1>();
    coord_t z = a.get<2>() - b.get<2>();
    return x*x + y*y + z*z;
  }
}

int main() {
  using namespace std;
  using namespace boost::lambda; // _1, _2, bind()

  // read array from stdin
  vector<point_t> points;
  cin.exceptions(ios::badbit); // throw exception on bad input
  while(cin) {
    coord_t x,y,z;
    cin >> x >> y >> z;    
    points.push_back(boost::make_tuple(x,y,z));
  }

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;  // 1.14s

  // find 10 nearest points using partial_sort() 
  // Complexity: O(N)*log(m) comparisons (O(N)*log(N) worst case for the implementation)
  const size_t m = 10;
  partial_sort(points.begin(), points.begin() + m, points.end(), 
               bind(less<coord_t>(), // compare by distance to the point
                    bind(distance_sq, _1, point), 
                    bind(distance_sq, _2, point)));
  for_each(points.begin(), points.begin() + m, cout << _1 << "\n"); // 1.16s
}

Run it:

g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
point: (69.0631 2.23409 50.4198)
(69 2 50)
(69 2 51)
(69 3 50)
(69 3 50)
(69 3 51)
(70 2 50)
(70 2 51)
(70 2 51)
(70 3 51)
(69 1 51)

real    0m1.152s
user    0m1.140s
sys 0m0.010s

Priority Queue in C++ (time ~1.2 seconds)

#include <algorithm>           // make_heap
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/range.hpp>     // boost::begin(), boost::end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  template<class T>
  class less_distance : public std::binary_function<T, T, bool> {
    const T& point;
  public:
    less_distance(const T& reference_point) : point(reference_point) {}

    bool operator () (const T& a, const T& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours+1];

  // populate `points`
  for (size_t i = 0; getpoint(cin, points[i]) && i < nneighbours; ++i)
    ;

  less_distance<point_t> less_distance_point(point);
  make_heap  (boost::begin(points), boost::end(points), less_distance_point);

  // Complexity: O(N*log(m))
  while(getpoint(cin, points[nneighbours])) {
    // add points[-1] to the heap; O(log(m))
    push_heap(boost::begin(points), boost::end(points), less_distance_point); 
    // remove (move to last position) the most distant from the
    // `point` point; O(log(m))
    pop_heap (boost::begin(points), boost::end(points), less_distance_point);
  }

  // print results
  push_heap  (boost::begin(points), boost::end(points), less_distance_point);
  //   O(m*log(m))
  sort_heap  (boost::begin(points), boost::end(points), less_distance_point);
  for (size_t i = 0; i < nneighbours; ++i) {
    cout << points[i] << ' ' << distance_sq(points[i], point) << '\n';  
  }
}

Run it:

$ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )

point: (69.0631 2.23409 50.4198)
(69 2 50) 0.235007
(69 2 51) 0.395424
(69 3 50) 0.766819
(69 3 50) 0.766819
(69 3 51) 0.927236
(70 2 50) 1.1088
(70 2 51) 1.26922
(70 2 51) 1.26922
(70 3 51) 1.80103
(69 1 51) 1.86361

real    0m1.174s
user    0m1.180s
sys 0m0.000s

Linear Search -based approach (time ~1.15 seconds)

// $ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>           // sort
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/range.hpp>     // begin(), end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

#define foreach BOOST_FOREACH

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b);

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out);    

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  class less_distance : public std::binary_function<point_t, point_t, bool> {
    const point_t& point;
  public:
    explicit less_distance(const point_t& reference_point) 
        : point(reference_point) {}
    bool operator () (const point_t& a, const point_t& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;
  less_distance nearer(point);

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours];

  // populate `points`
  foreach (point_t& p, points)
    if (! getpoint(cin, p))
      break;

  // Complexity: O(N*m)
  point_t current_point;
  while(cin) {
    getpoint(cin, current_point); //NOTE: `cin` fails after the last
                                  //point, so one can't lift it up to
                                  //the while condition

    // move to the last position the most distant from the
    // `point` point; O(m)
    foreach (point_t& p, points)
      if (nearer(current_point, p)) 
        // found point that is nearer to the `point` 

        //NOTE: could use insert (on sorted sequence) & break instead
        //of swap but in that case it might be better to use
        //heap-based algorithm altogether
        std::swap(current_point, p);
  }

  // print results;  O(m*log(m))
  sort(boost::begin(points), boost::end(points), nearer);
  foreach (point_t p, points)
    cout << p << ' ' << distance_sq(p, point) << '\n';  
}

namespace {
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }
}

Measurements shows that most of the time is spent reading array from the file, actual computations take on order of magnitude less time.

J.F. Sebastian
Nice write up. To offset for file read I've run your python implementations with loop around the search that execute 100 times (each time looking around a different point and building the kd-tree only once).The bruteforce still won. Made me scratch my head. But then I examined your leafsize and you have an error there - you are setting the leafsize to 1000001, and that'll not perform well.After setting leafsize to 10, kd won (35s to 70s for 100 points, with most of 35s spent on building the tree and 100 retrievals of 10 points taking a second).
Unreason
So as a conclusion, if you can precompute the kd-tree it will win over brute force by orders of magnitude (not to mention that for really large data sets, if you have a tree you will not have to read all the data in memory).
Unreason
@goran: if I set leafsize to 10 then it takes ~10 seconds (instead of 1 second) to query one point. I agree if the task is to query multiple (>10) points then kd-tree should win.
J.F. Sebastian
@Unreason: pririty queue- and linear search- based implementations above do NOT read read all the data in memory.
J.F. Sebastian
@J.F.Sebastian, sorry I was imprecise - they have to scan all the data (read it in memory). I agree that they don't have to keep it in memory. The kd tree approach (once it has a kd index will have to scan the index only, reading only part of it). Of course to build the index it will have to read all the data, but I was commenting on the situation where you *can* precompute (which in case of linear search is no gain)
Unreason
+3  A: 

No need to calculate the distance. Just the square of the distance should serve your needs. Should be faster I think. In other words, you can skip the sqrt bit.

Vulcan Eager
+4  A: 

I think this is a tricky question that tests if you don't try to overdo things.

Consider the simplest algorithm people already have given above: keep a table of ten best-so-far candidates and go through all the points one by one. If you find a closer point than any of the ten best-so-far, replace it. What's the complexity? Well, we have to look at each point from the file once, calculate it's distance (or square of the distance actually) and compare with the 10th closest point. If it's better, insert it in the appropriate place in the 10-best-so-far table.

So what's the complexity? We look at each point once, so it's n computations of the distance and n comparisons. If the point is better, we need to insert it in the right position, this requires some more comparisons, but it's a constant factor since the table of best candidates is of a constant size 10.

We end up with an algorithm that runs in linear time, O(n) in the number of points.

But now consider what is the lower bound on such an algorithm? If there is no order in the input data, we have to look at each point to see if it's not one of the closest ones. So as far as I can see, the lower bound is Omega(n) and thus the above algorithm is optimal.

Krystian
Excellent point! Since you have to read the file one by one in order to build any data structure, your lowest *possible* is O(n) just as you say.Only if the question asks about finding the closest 10 points repeatedly does anything else matter!And you explained it well I think.
Zan Lynx
+1  A: 

This question needs further definition.

1) The decision regarding the algorithms that pre-index data varies very much depending if you can hold the whole data in the memory or not.

With kdtrees and octrees you will not have to hold the data in memory and the performance benefits from that fact, not only because the memory footprint is lower but simply because you will not have to read the whole file.

With bruteforce you will have to read the whole file and recompute distances for each new point you will be searching for.

Still, this might not be important to you.

2) Another factor is how many times will you have to search for a point.

As stated by J.F. Sebastian sometimes bruteforce is faster even on large data sets, but take care to take into account that his benchmarks measure reading the whole data set from disk (which is not necessary once kd-tree or octree is built and written somewhere) and that they measure only one search.

Unreason
+1  A: 

For any two points P1 (x1, y1, z1) and P2 (x2, y2, z2), if the distance between the points is d then all of the following must be true:

|x1 - x2| <= d 
|y1 - y2| <= d
|z1 - z2| <= d

Hold the 10 closest as you iterate over your entire set, but also hold the distance to the 10th closest. Save yourself a lot of complexity by using these three conditions before calculating the distance for every point you look at.

Kirk Broadhurst