views:

258

answers:

6

Interpolating Large Datasets

I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.

I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.

What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?

Cheers, Karl

+2  A: 

One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:

SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates 
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)

Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.

M. Jessup
+1  A: 

Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.

I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.

Maelstrom
+1  A: 

If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.

But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.

CPerkins
+3  A: 

There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.

Summarizing a trend over time

Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):

> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)

The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.

It's also straightforward to access the results of the smoothing. Here's how to do that:

> data = lowess( cars$speed, cars$dist )
> data
$x
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25

$y
 [1]  4.965459  4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698

The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.

James Thompson
A: 

How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:

class MedianEnumeration implements Enumeration<Double>
{
    private Enumeration<Double> frameEnum;
    private int frameSize;

    MedianEnumeration(Enumeration<Double> e, int len) {
        frameEnum = e;
        frameSize = len;
    }

    public boolean hasMoreElements() {
        return frameEnum.hasMoreElements();
    }

    public Double nextElement() {
        Double sum = frameEnum.nextElement();

        int i;
        for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
            sum += (Double)frameEnum.nextElement();
        }

        return (sum / i);
    }
}
ony
+1  A: 

The naive approach is simply calculating an average per timeinterval corresponding to a pixel.

http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg

This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).

Thorbjørn Ravn Andersen
The _really_ naive solution would be to simply take every N-th value. I expect taking e.g. every 100-th value in a 100k dataset would still provide a very good picture of the measured value's history and no other method could touch it in terms of performance.
Tomislav Nakic-Alfirevic
True. It appears that speed is more important than pixel accuracy.
Thorbjørn Ravn Andersen
This answer seems oddly... familiar. ;]
CPerkins
And you are implying?...
Thorbjørn Ravn Andersen
Implying nothing. Read your answer, read mine. We seem to have thought along similar lines.
CPerkins