views:

630

answers:

10

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?

One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.

The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.

Edit: Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.

A: 

First of all, ask yourself if these are sets, or ordered collections.

I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.

Marcin
I this case, all my sets are ordered and consist of 24 hourly values each. The problem with selecting a tolerance level is that I don't know the scale of the values before it is run.
Adam Hughes
Wait...you know that these are weather data, yet you don't know the scale? If that is really the case, why not select the tolerance dynamically by analysing the standard deviation of the data?
Marcin
+4  A: 

I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.

Since your want to compare measurements over time you can just leave out missing values from the calculation.

For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.

Nils Pipenbrinck
+1  A: 

In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.

In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).

Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)

RossFabricant
+1  A: 

Look at statistical sites. I think you are looking for correlation.

leppie
Correlation is one of the first things I checked out, but it only measures the similarity of the curves, not the actual values. If the temperatures rose and fell by the same amount each hour but were 100 degrees different, the correlation would still be 1.
Adam Hughes
A: 

I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:

function calculate_score(historical_set, forecast_set)
{
    double c = correlation(historical_set, forecast_set);
    double avg_history = average(historical_set);
    double avg_forecast = average(forecast_set);
    double penalty = abs(avg_history - avg_forecast) / avg_forecast
    return c - penalty;
}

I then sort all the results from high to low.

Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.

Adam Hughes
+1  A: 

As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:

  • Temp: -50 to 100F (I'm in Minnesota, USA)
  • Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
  • Precip: 0 to 100

Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.

Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:

for each historicalpoint

    distance = sqrt(
        pow(currentpoint.temp - historicalpoint.temp, 2) + 
        pow(currentpoint.wind - historicalpoint.wind, 2) +
        pow(currentpoint.precip - historicalpoint.precip, 2))

    if distance is smaller than the largest distance in our match collection
        add historicalpoint to our match collection
        remove the match with the largest distance from our match collection

next

This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.

Cheers.

Corbin March
A: 

A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.

I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.

Normalization in any style should make all variables comparable.

As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.

This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).

Gregg Lind
Weather is just one example type of data this algorithm could be used for. I don't know what type of data will be fed into it. It could be hot dog sales at a baseball stadium, the number of cars traveling on a highway, electricity load, etc.
Adam Hughes
+1  A: 

Talk to a statistician.

Seriously.

They do this type of thing for a living.

You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.

This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.

Michael Dorfman
+1  A: 

Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php

A: 

Are the two data sets ordered, or not?

If ordered, are the indices the same? equally spaced?

If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second, and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP

Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.

====

If unordered, I think you want to look at the cumulative distribution functions http://en.wikipedia.org/wiki/Cumulative_distribution_function

One relevant test is Kolmogorov-Smirnov: http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test

You could also look at

Student's t-test, http://en.wikipedia.org/wiki/Student%27s_t-test

or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

to test equality of means between the two samples.

And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm

Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.

Gerry