views:

336

answers:

6

I have a dataset with 100 000 datapoints which I have to plot on a graph. The resulting graph will be about 500px wide, so for every pixel there will be about 200 datapoints, which seems quite unnecessary.

I need to find a way to get rid of the excess datapoints without losing the shape of the graph to speed up the rendering. Currently the rendering of all 100 000 points can take 10+ seconds as I'm also using anti-aliasing and other "effects".

I tried to approach this problem by just taking every 200th datapoint and plotting them, but this results in some of the more significant points missing out (think about spikes in the graph that I want to be able to show). I also thought of splitting the dataset in chunks of 200 datapoints, then taking the maximum value from every chunk but that wont work either.

Is anyone aware of a method that would suit my needs here? The language I'm using is PHP, graph is created by GD and data is coming from MySQL, so optimizations to some of those are welcome.


The data is in this format:

Datetime               Value
2005-01-30 00:00:00    35.30
2005-01-30 01:00:00    35.65
2005-01-30 02:00:00    36.15
2005-01-30 03:00:00    35.95
...

And the resulting graph currently looks like this:

alt text

A: 

I don't know what your code/data source looks like but is it possible to do a distinct on your mysql select statement to reduce the number of data points being brought back to your application?

mynameiscoffey
I updated my answer to include some sample data. Using DISTINCT won't work as it might skip the more "important" points.
Tatu Ulmanen
I see what you are looking at now, for each pixel of width, how many points are contained within that and how do you determine which pixel width contains which items?
mynameiscoffey
+1  A: 

I think that ordinary average from each 200 bunch of points would be just enough.

Or as it was pointed higher, you can take max of this 200 points or any other you want (it depends from info you need from this graph)
Ordinary average is not enough if I have 199 points with value of 15 and 1 point with value of 1200. I want to be able to show that one distinct spike there.
Tatu Ulmanen
+7  A: 

It seems to me that 1 in 200 is pretty serious data loss, and if those 200 values that should be represented with one value on the graph aren't close enough to be meaningfully substituted with an average, you have yourself a problem. If average isn't good enough, you must find a criterium to tell what data is more significant and should be included, and we can't help you with it because we don't know what kind of data it is, its statistical properties, or why any value would be more significant than the other. With those additional info, maybe a more specific answer could be given.

EDIT: After looking at the graph, it seems that you need both minimum and maximum in a given interval, because the dark blue area are values between those two, correct? Maybe you can take 100 values and make a graph from minimum, maximum, and average, so that every point in graph is made with 6 instead of 200 values, or something like that.

Slartibartfast
Yes, I thought also about using both min and max. Perhaps I could get a similar result by using two lines and shading the inbetween, and perhaps a third line to show the average value on top. Good points. Unless someone comes up with a solid equation on how to do this the way I originally intended, I'm going to mark this one as accepted.
Tatu Ulmanen
+2  A: 

Hi

One approach to your problem is max-min decimation; I suggest you Google for a definition and algorithm I don't have either to hand or I would share with you.

Beyond that I think you might use a low-pass (anti-aliasing) filter followed by simple decimation (ie throwing away excess points).

Regards

Mark

High Performance Mark
+2  A: 

Another approach that might work is splitting the graph up into 200 point bins, and discard all but the maximum, minimum, and median points in each interval. Each of the three points in the interval gets plotted at its original location, so the locations of the extreme values won't change. Using the median instead of the mean will probably work better for your data set because the maxima are much more extreme than the minima, which would cause the filtered graph to shift upwards if you used the mean.

Theran
Good points, thanks.
Tatu Ulmanen
+2  A: 

I know this question is quite old but I had a problem almost similar.

To reduce the number of points to display without affecting the shape of the graph, We use the Ramer-Douglas-Peucker algoritm. The difference of shape between the uncompressed graph and the one with this algorithm is unnoticeable.

Francis B.