tags:

views:

138

answers:

2

Hi there. Im trying to build a kd-tree for searching through a set of points, but am getting confused about the use of 'median' in the wikipedia article. For ease of use, the wikipedia article states the pseudo-code of kd-tree construction as:

function kdtree (list of points pointList, int depth)
{
    if pointList is empty
        return nil;
    else
    {
        // Select axis based on depth so that axis cycles through all valid values
        var int axis := depth mod k;

        // Sort point list and choose median as pivot element
        select median by axis from pointList;

        // Create node and construct subtrees
        var tree_node node;
        node.location := median;
        node.leftChild := kdtree(points in pointList before median, depth+1);
        node.rightChild := kdtree(points in pointList after median, depth+1);
        return node;
    }
}

I'm getting confused about the "select median..." line, simply because I'm not quite sure what is the 'right' way to apply a median here.

As far as I know, the median of an odd-sized (sorted) list of numbers is the middle element (aka, for a list of 5 things, element number 3, or index 2 in a standard zero-based array), and the median of an even-sized array is the sum of the two 'middle' elements divided by two (aka, for a list of 6 things, the median is the sum of elements 3 and 4 - or 2 and 3, if zero-indexed - divided by 2.).

However, surely that definition does not work here as we are working with a distinct set of points? How then does one choose the correct median for an even-sized list of numbers, especially for a length 2 list?

I appreciate any and all help, thanks!

-Stephen

A: 

You have to choose an axis with as many element on one side than the other. If the number of points is odd or the points are positioned in such a way that it isn't possible, just choose an axis to give an as even repartition as possible.

AProgrammer
A: 

It appears to me that you understand the meaning of median, but you are confused with something else. What do you mean be distinct set of points?

The code presented by wikipedia is a recursive function. You have a set of points, so you create a root node and choose a median of the set. Then you call the function recursively - for the left subtree you pass in a parameter with all the points smaller than the split-value (the median) of the original list, for the right subtree you pass in the equal and larger ones. Then for each subtree a node is created where the same thing happens. It goes like this:

First step (root node): Original set: 1 2 3 4 5 6 7 8 9 10 Split value (median): 5.5

Second step - left subtree: Set: 1 2 3 4 5 Split value (median): 3

Second step - right subtree: Set: 6 7 8 9 10 Split value (median): 8

Third step - left subtree of left subtree: Set: 1 2 Split value (median): 1.5

Third step - right subtree of left subtree: Set: 3 4 5 Split value (median): 4

etc.

So the median is chosen for each node in the tree based on the set of numbers (points,data) which go into that subtree. Hope this helps.

PeterK
I apologise if my meaning wasn't clear. What I meant by 'distinct' was that if I was trying to form a kd-tree with the points (1,1), (2,2), (3,3), (4,4), (5,5) and (6,6), the median would normally be (3.5, 3.5). However, (3.5,3.5) doesn't exist in the points I'm making a kd-tree out of, so what happens? From your example above, I assume you actually create a new node for the tree, the median node?
Stephen
You are confused in two ways. First, when looking for the median, you must pick one dimension! So the median in your example cannot be (3.5, 3.5), since that is a two dimensional point. Instead the first step is to pick the dimension (say you pick the first). Then you look just at the first dimension of all the points and calculate the median. How to pick the best dimension is another thing. Second thing: no, you dont create a new node. The median is just a value - not the part of the original data. Look at it as a attribute of the splitting node of the tree.
PeterK
Ah, right, okay. Then I'm afraid I am still confused, sorry =[. In the first step, the root node, you declare the median value to be 5.5. But which node becomes the root node then? Your left and right sub-trees contain *all* of the points, so which was chosen to be the root of these sub-trees?
Stephen
I just found a different page - http://groups.csail.mit.edu/graphics/classes/6.838/S98/meetings/m13/kd.html - which seems to imply that in a kd-tree, the nodes of the trees are **not** points that you want to classify; these are apparently stored in leaves. Is this correct, and I've been looking at kd-trees wrong this entire time? Wikipedia seemed to imply that the nodes are the points in the tree!
Stephen
That is correct, the nodes are just something like crossroads. They tell you, that if you go left, you will find points that have value of dimension X lower than Y (median). If you go right, you will find points which have value in dimension X higher or equal to Y (median). So when you arrive at a leaf, the path you traversed limits the points to certain criterion (values of some dimensions are such and such). So the nodes just store the split value (median) and tell you in which dimension the splitting took place. Only the leaves store the actual data.
PeterK
Ah, right, thank you very much. I'd been looking at this all wrong the entire time! I'll go have another crack at it, and thanks very much for putting up with my rather slow uptake!
Stephen
In my example, i was just showing you which points would eventually end up in a leaf if you go the given way. It does not mean that those points are actually stored in the nodes. Depending on how many points you allow to be stored in one leaf the splitting path might be quite long. The parameter of leaf size is often refered to as "bucket size". This is a good analogy, since the points fall through the tree (are filtered by the nodes according to splitting rule - dimension and value (median)). And eventually they fall into a bucket.
PeterK