views:

296

answers:

4
+3  A: 

Hi Alix,

I think it's possible. Try using a function that adds up the difference. For instance:

SELECT *, ABS(i - @i) + ABS(j - @j) + ABS(k - @k) + ABS(x - @x) + ABS(y - @y) + ABS(z - @z) FROM table

This will give you your fuzzy logic difference value. The lower, the closer the match. You should also be able to sort by this expression: ie, order by ABS(i - @i)... ASC

A couple of suggestions, perhaps select the results to a temp table and sort on this.

Alternatively, if you are using SQL server, look in to using the .NET CLR and doing it in an .NET assembly - this will give you more power and options. It will also allow you to attach extra weight to certain fields more easily (if you meant that some fields should have additional weighting).

Hope this helps, Steve

MrCraze
This query just returns NULL for me. He tagged MySQL.
Pentium10
Hi, sorry - yes, he did. So CLR is out. However, MySQL also has the ABS function. Remember if any of the values are null, the whole expression will be null. So, if you don't want the null values to count:COALESCE(ABS(i-@i),0) + .... for each expression - then nulls will be treated as 0's. Alternatively, you can put a higher number in there to ensure nulls decrease the relevance - eg, COALESCE(ABS(i-@i),10)... Depending on what you want to achieve. I assumed there would be no nulls in this data - sorry.
MrCraze
what ABS(i-@i) suppose to do? I don't get that @ operator there.
Pentium10
@Pentium10: @i is a variable you can define.
Alix Axel
@MrCraze: +1, However using your solution wouldn't the `moonDistance` values pretty much make all the other values irrelevant? Also, you mentioned weighting the columns, could you give me a suggestion about which factors to use in my weather data example? Thanks.
Alix Axel
Hi Alix, thanks for positive feedback. As for weighting, maybe you consider temperature as the most important factor so it could be COALESCE(ABS(i-@i)) * 10, which would mean differences in temperature would give higher number and therefore less relevance. The opposite could be done for say pressure-perhaps divide the difference by 10, or whatever is most fitting. Similar logic could be used to reduce relevance of moondistance. A weather analyst best decide that :) hope this helps!
MrCraze
@MrCraze: I was hoping for a solution that could analyze values as a whole since I've no idea what constitutes greater importance.
Alix Axel
Hi Alix, I think Pentium10 is on the right track. I would select the standard deviation of all the values first of all into variables to speed up the query, then divide the difference between the current date and that of the query by the standard deviation, adding these together for each element.Eg. SELECT @IDev = STDDEV(i), @JDev = STDDEV(j)... thenSELECT *, Rank (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)... FROM table ORDER BY (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)...This will add the number of standard deviations each value is from the specified parameter (ie, today) - give better weighting
MrCraze
+1  A: 

I'm not sure if this is relevant or not to the question, but some values (i, j, k) mean more is better while other values (x, y, z) mean the opposite: less is better.

Since you are dealing with dates, you may want to derive other dates based on these criteria.
For example you can compute new dates, for a value having a higher effect you add couple of hours/days/weeks/months to your date, and if a value has a lover effect you add just seconds/minutes to your dates. Then you sort based on this computed new date.
You can subtract dates using negative compensations if they reach a peak level

Check out Fuzzy SQL (FSQL)

EDIT 1

Try this: std(i)*abs(i-@i) ...

SELECT 
  (select std(i) from fuzzysearch)*abs(i-7) as v1,
  (select std(j) from fuzzysearch)*abs(j-90) as v2,
  f.* 
FROM fuzzysearch as f 
order by 
  v1 asc, 
  v2 desc
Pentium10
Didn't knew MySQL had a STD() built-in function, thanks. Two questions: **1)** wouldn't your query be more favorable (in terms of sorting) to the `i` (`v1`) column? **2)** Why do you sort `v1` ASC and `v2` DESC?
Alix Axel
I must've used the computed value because it differs if for example you set: 10 than `i`. And with `asc` and `desc` I wanted to emphasize that some columns have different order. If temp is equal a higher humidity is better. MySQL has also `VARIANCE()`.
Pentium10
@Pentium10: Thank you, regarding the first question I asked I was not referring to the use of the computed values, but to the order of the `ORDER BY` clause itself. Shouldn't MySQL order by `v1` primarily and only order by `v2` if there are two or more `v1` values with the same "ranking" / order?
Alix Axel
MySQL orders by v1 primarily and only order by v2 if there are two or more v1 values with the same ranking. That's true. `order by v1 asc, v2 desc`
Pentium10
@Pentium10: In that case this will fail to order by all values as a whole. =\
Alix Axel
Yeah it will. However I think helps you to get some answers, until you find a working formula to compute a value of all.
Pentium10
+5  A: 

The problem you seem to have is that each column has a different scale and so you can't easily combine them. This problem can be solved using a technique called whitening. This involves calculating the average and standard deviation of each column (you could do this in 1 SQL statement) and then rescaling each column to this when selecting:

colSortPos = (colValue-colMean) / colStdev

Doing this will give you each column ranging around 0 which +/- 1 standard deviation within the range +/- 1. The trick then is to combine these so that similar dates are together. The problem here is that this is not a 2 dimensional problem and so you need to think multidimensionally. So my suggestion its to take the the Euclidean distance as your sort order.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

The only problem with this is that it projects your problem onto a 1 dimensional space that may make you miss some correlations. To work around this I suggest using a clustering technique like K-means which is pretty simple to implement and is really fast. This will allow you to group your dates into k clusters that display the most similarity [ http://en.wikipedia.org/wiki/K-means_clustering ]. If you have the raw data and want to play around with these (and other) techniques then I suggest trying the weka toolkit [ http://www.cs.waikato.ac.nz/ml/weka/ ] which will let you play around with these techniques.

Neel
I think I made a mistake in my SQL, the Euclidean distance I stated would only give you closest to the mean, if you want to compare with a particular row then you'd need the distance to that:SQRT( POW(@iTarget - ((i-@iMean)/@iStdDEv), 2) + POW(@jTarget - ((j-@jMean)/@jStdDEv), 2) + POW(@kTarget - (k-@kMean)/@kStdDEv, 2) )Of course the whitened values can be pre-calculated in the table and then you could make this calc easier.
Neel
Hi Neel, really interesting answer - I am wondering why each result is squared and then the total square rooted? Also, is it necessary to use the mean at all - I tryed solving it by using ((@iTarget - i)/@iStdDev) instead - giving you the measure of distance between them in the number of standard deviations.
MrCraze
You want to do this to get the distance from the point you're interested in to the point you're at. Think about a right angled triangle, to get the distance from (x1,y1) to (x2,y2) you find the hypotenuse which is sqrt((x1-x2)^2 + (y1-y2)^2). In 3D (x,y,z) this becomes sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2). Think of your columns as axes on a graph, the points that are most similar would be spacially closer together. The more columns you have the more dimensions you need to consider. Your intuition to do ((@iTarget - i)/@iStdDev) is correct as this will measure the distance to some target.
Neel
@Neel: Like MrCraze said this is a really interesting answer, I'll try it in a couple of hours and post my feedback here - regarding Weka: it seems like a really nice project (I actually bough the "Data Mining - Practical Machine Learning Tools and Techniques" book by the same authors a while back) but I think I lack the math skills to fully understand how to use it.
Alix Axel
@Alix Will be interested in the results, I have some ideas on how you could do this in a precomputed way to allow you to do arbitrary sorts but will hold back to see if this works out for you.
Neel
+1  A: 

This is hard (if not impossible) to do in SQL, but there's a multivariate analysis technique called Factor Analysis. It's a way of creating a "factor" - a linear combination of your variables which puts a weight on each input to get a "factor score". Basically it does a bunch of linear algebraic manipulations on your data to create a singles set of scores that minimizes some objective (like total errors).

I did a factor analysis on 3 days of hourly weather data and it looks pretty good. You can see that entries with similar factor scores generally are very close in values across all four measures. I used a principal component extractions with an Equimax rotation:

Wind    Air    Dewpoint    BP         Score
-------------------------------------------
3       12     9           1012.2     -2.72
5       17     11          1011.9     -1.77
5       16     10          1010.8     -1.75
6       15     10          1010.4     -1.68
3       19     13          1012.1     -1.57
5       17     11          1010.8     -1.54
7       17     12          1012.0     -1.53
8       19     12          1012.0     -1.24
5       20     14          1012.0     -1.18
7       17     12          1009.8     -1.06
9       20     13          1012.5     -1.05
8       21     12          1012.1     -1.05
7       19     11          1009.9     -0.98
6       18     13          1009.6     -0.90
0       24     17          1012.1     -0.90
8       20     12          1010.8     -0.88
5       22     15          1012.0     -0.87
10      21     12          1012.0     -0.86
8       21     12          1010.8     -0.78
9       19     12          1010.2     -0.78
8       23     10          1010.7     -0.76
6       23     15          1012.3     -0.75
7       20     12          1009.7     -0.73
7       20     12          1009.6     -0.71
10      22     14          1011.5     -0.45
7       19     15          1009.3     -0.45
6       20     16          1009.7     -0.41
7       20     15          1009.4     -0.37
10      24     12          1010.6     -0.26
5       26     18          1012.5     -0.26
9       23     15          1011.0     -0.22
12      24     16          1012.3     -0.04
6       25     16          1009.5      0.15
5       28     20          1012.5      0.16
0       28     17          1009.0      0.16
8       24     17          1010.2      0.17
3       30     20          1012.5      0.19
12      23     16          1010.6      0.21
8       25     15          1009.5      0.21
13      25     13          1010.5      0.22
13      25     14          1010.6      0.30
5       25     20          1010.1      0.35
6       25     19          1009.9      0.37
9       23     18          1009.5      0.40
13      25     15          1010.3      0.46
5       30     21          1012.4      0.48
7       26     19          1010.0      0.54
12      27     16          1010.9      0.56
8       24     20          1009.7      0.58
7       27     17          1009.2      0.60
7       27     18          1009.6      0.62
7       27     17          1009.0      0.64
8       26     21          1010.6      0.70
9       28     17          1009.8      0.75
8       25     22          1010.0      0.82
8       26     23          1010.8      0.86
8       25     22          1009.8      0.87
8       25     23          1010.1      0.91
9       26     22          1010.5      0.91
8       26     22          1009.8      0.97
14      29     17          1010.8      1.06
12      26     22          1010.8      1.09
10      31     18          1010.3      1.14
14      30     18          1010.7      1.28
17      29     17          1010.9      1.29
10      27     22          1009.4      1.32
12      26     22          1009.7      1.32
8       27     24          1009.3      1.38
14      27     22          1010.2      1.49
12      28     22          1009.8      1.51
16      31     19          1010.7      1.66
14      28     23          1009.6      1.82
Grembo
Seems interesting and may be worth enough to try it, can you get me some more info on the factor analysis method you mentioned?
Alix Axel