views:

453

answers:

4

I want to use a data structure for sorting space-time data (x,y,z,time).

Currently a processing algorithm searches a set of 4D (x,y,z,time) points, given a spherical (3d) spacial radius and a linear (1d) time radius, marking for each point, which other points are within those radii. The reason is that after processing, I can ask any 4D point for all of its neighbours in O(1) time.

However in some common configurations of space and time radii, the first run of the algorithm takes about 12 hours. Believe it or not, that's actually fast compared to what exists in our industry. Nevertheless, I want to help speed up the initial runs and so I want to know: Is a kd-tree suitable for 4D space-time data?

Note that I am not looking for implementations of nearest-neighbour search or k-nearest-neighbours search.

More Info:

An example dataset has 450,000 4D points.

Some datasets are time-dense so ordering by time certainly saves processing, but still leads to many distance checks.

Time is represented by Excel-style dates, with typical ranges between 30,000-39,000 (approximate). The space ranges are sometimes higher values, sometimes lower values, but the range between each space co-ordinate is similar to time (e.g. maxX-minX ~ maxT-minT).

Even more info:

I thought I'd add some more slightly irrelevant data in case anybody has dealt with a similar dataset.

Basically I'm working with data that represents space-time events that are recorded and corroborated by multiple sensors. Error is involved, so only events that meet an error threshold are included.

The time span of these datasets ranges between 5-20 years of data.

For the really old data (>8 years old), the events were often very spacially dense for two reasons: 1) there were relatively few sensors available back then, and 2) the sensors were placed close together so that nearby events could be properly corroborated with low error. Further events could be recorded, but they had too high an error

For the newer data (<8 years old), the events are often very time dense, for the inverse reasons: 1) there are usually many sensors available, and 2) the sensors are placed at regular intervals over a larger distance.

As a result, the datasets cannot typically be said to be only time-dense or only spacially dense (except in the case of datasets that contain only new data).

Conclusion

I clearly should be asking more questions on this site.

I will be testing out several solutions over the next while which will include the 4d kd-tree, a 3d kd-tree followed by time distance check (suggested by Drew Hall), and the current algorithm I have. Also, I have been suggested another data structure called TSP (time space partitioning) tree, which uses an octree for space and a bsp on each node for time, so I may test that as well.

Assuming I remember, I'll be sure to post some profiling benchmarks on different time/space radii configurations.

Thanks all

+2  A: 

You haven't really given enough information to answer this.

But sure, in general kd-trees are perfectly suitable for 4 (or 5 or 6 or...) dimensional data --- if the spatial (or in your case space/time-ial) distribution lends itself to kd-tree decomposition. In other words, it depends (sound familiar?).

kd-trees are just one method of spatial decomposition that lend themselves to certain localized searches. As you go to higher dimensions, the curse of dimensionality problem rears it's head, of course, but 4d isn't too bad (you probably want at least a several hundred points though).

In order to know if this will work for you, you have to analyse some other criteria. Is approximate NN search good enough (this can help a lot). Is tree balancing likely to be expensive? etc.

simon
It's a dataset that grows over time. Currently one of the datasets has 450,000 4D points, so I suspect the curse is negligible for these datasets. The kd-tree question relates to someone saying that you can't mix Euclidean and non-Euclidean co-ordinates using Euclidean distance. To be perfectly honest, I don't know whether time is considered Euclidean or not in this case.
Chris Cameron
Presumably you're using the Euclidean metric, not the Minkowski metric (if you have to ask, it's Euclidean). Euclidean means your distance formula looks like d^2 = deltax^2 + deltay^2 + deltaz^2 + deltat^2. A Minkowski metric would change the sign on deltat^2.
kquinn
I'm not using any metric that includes time currently, because I process time separately from space. Time is used to limit the search because the 4d points are in time order. Then I do standard Euclidean distance like you wrote (but without deltat^2).
Chris Cameron
So you're not doing anything special or weird with the time coordinate. That means it should work fine as part of a kd-tree.
kquinn
Can I ask then, what would make time Euclidean or non-Euclidean? That is, why is it that I can use the kd-tree with time and a 4d Euclidean distance metric? I mean no, I'm not doing anything weird like Relativity Theory with time (lol thank god), but why would most literature on kd-trees be so specific about the "same-space" clause? e.g., What fourth dimension would not work with a kd-tree if the first three were Euclidean space?
Chris Cameron
If time is a Euclidean coordinate, then that means it's not actually special. It's just another dimension of your data, orthogonal to the others. That means all the usual rules apply. It's when time *isn't* a true independent coordinate that things start to get messy, and the usual rules and assumptions go out the window (under Minkowski metrics, for instance, the concept of "simultaneity" is totally different!). Making one coordinate play by different rules makes partitioning *hard*.
kquinn
So if I understand you correctly, you're saying if the time co-ordinate were dependent on one or more of the other three spacial co-ordinates, it would not be Euclidean?
Chris Cameron
Yeah, that's a good summary. The important bit is the distance function; when do you consider two points to be "close in time"? If you only have to look at the time coordinate to answer that question, time is a Euclidean coordinate. If, say, you recorded local time for sensors spread across the Earth, that would be non-Euclidean (and non-sane), since noon in New York and noon in Tokyo are *not* the same thing, and you have to consider the cities' locations to extract a useful time coordinate.
kquinn
You should post a summary of this info as an answer so I can select it, it's basically exactly what I was looking for. I was reading about how the concept of "spacetime" is non-Euclidean which worried me but now I realize the non-Euclidean spacetime I read about only matters in relativity theory (e.g.,time affected by velocity) and the case you just mentioned with timezones.
Chris Cameron
there are some good points here, glad my comment could spark it even if I couldn't be here for the discussion....
simon
+2  A: 

If you stored an index to your points sorted in the time dimension, couldn't you first perform an initial pruning in the 1-d time dimension, thus reducing the number of distance calculations? (Or is that an oversimplfication?)

Mitch Wheat
Actually that's how the algorithm works currently. The data is sorted in the time dimension, so we can prune while traversing in time order. That said, the data is sometimes time-dense, leading to more distance calculations than we'd like :/
Chris Cameron
Am I correct to conclude that the hyperbody to be searched is not a hypersphere but a "hypercylinder" (with 3d-spherical caps at each end)?
Svante
I don't know honestly. See my updates for a description of the data, hopefully that helps.
Chris Cameron
+1  A: 

If your data is relatively time-dense (and relatively space-sparse), it might work best to use a 3d kd-tree on the spatial dimensions, then simply reject the points that are outside the time window of interest. That would get around your mixed space/time metric problem, at the expense of a slightly more complex point struct.

Drew Hall
I will likely test this option as well. It would be fantastic to find out that the quick lookup in the kd-tree followed by a 1d time distance check was faster than the current implementation, which is currently the inverse: optimized time lookup followed by costly 3d distance check (though I do also have a box-prune prior to the distance check).
Chris Cameron
+3  A: 

To expand a little bit on my comments to an answer above:

According to the literature, kd-trees require data with Euclidean coordinates. They are probably not strictly necessary, but they're certainly sufficient: guaranteeing that all coordinates are Euclidean ensures that the normal rules of space apply, and makes it possible to easily partition points by their location and build up the tree structure.

Time is a little bit strange. Under the rules of special relativity, you use a Minkowski metric, not a standard Euclidean metric, when you're working with time coordinates. This causes all kinds of problems (most severe among them destroying the meaning of "simultaneity"), and generally makes people afraid of time coordinates. That fear is not well-founded, though, because unless you know you're working on physics, your time coordinate almost certainly actually will be Euclidean in practice.

What does it mean for a coordinate to be Euclidean? It should be independent of all the other coordinates. Saying time is a Euclidean coordinate means that you can answer the question "Are these two points close together in time?" by looking only at their time coordinates, and ignoring any extra information. It's easy to see why not having that property might break a scheme that partitions points by the values of their coordinates; if two points can have radically different time coordinates but still be considered "close in time", then a tree which sorts them by time coordinate is not going to work very well.

An example of a Euclidean time coordinate would be any time specified in a single, consistent time zone (like UTC times). If you have two clocks, one in New York and one in Tokyo, you know that if you have two measurements labelled "12:00 UTC" then they were taken at the same time. But if the measurements are taken in local time, so one says "12:00 New York time" and one is "12:00 Tokyo time", you have to use extra information about the locations and time zones of the cities to figure out how much time elapsed between the two measurements.

So as long as your time coordinate is consistently measured and sane, it will be Euclidean, and that means it will work just fine in a kd-tree or similar data structure.

kquinn