ansaurus

Question

Answer 1

+3 A:

Hi Alix,

I think it's possible. Try using a function that adds up the difference. For instance:

SELECT *, ABS(i - @i) + ABS(j - @j) + ABS(k - @k) + ABS(x - @x) + ABS(y - @y) + ABS(z - @z) FROM table

This will give you your fuzzy logic difference value. The lower, the closer the match. You should also be able to sort by this expression: ie, order by ABS(i - @i)... ASC

A couple of suggestions, perhaps select the results to a temp table and sort on this.

Alternatively, if you are using SQL server, look in to using the .NET CLR and doing it in an .NET assembly - this will give you more power and options. It will also allow you to attach extra weight to certain fields more easily (if you meant that some fields should have additional weighting).

Hope this helps, Steve

MrCraze 2010-02-13 11:55:28

This query just returns NULL for me. He tagged MySQL.

Pentium10 2010-02-13 11:59:50

Hi, sorry - yes, he did. So CLR is out. However, MySQL also has the ABS function. Remember if any of the values are null, the whole expression will be null. So, if you don't want the null values to count:COALESCE(ABS(i-@i),0) + .... for each expression - then nulls will be treated as 0's. Alternatively, you can put a higher number in there to ensure nulls decrease the relevance - eg, COALESCE(ABS(i-@i),10)... Depending on what you want to achieve. I assumed there would be no nulls in this data - sorry.

MrCraze 2010-02-13 12:11:52

what ABS(i-@i) suppose to do? I don't get that @ operator there.

Pentium10 2010-02-13 12:14:35

@Pentium10: @i is a variable you can define.

Alix Axel 2010-02-13 13:07:17

@MrCraze: +1, However using your solution wouldn't the `moonDistance` values pretty much make all the other values irrelevant? Also, you mentioned weighting the columns, could you give me a suggestion about which factors to use in my weather data example? Thanks.

Alix Axel 2010-02-13 13:27:32

Hi Alix, thanks for positive feedback. As for weighting, maybe you consider temperature as the most important factor so it could be COALESCE(ABS(i-@i)) * 10, which would mean differences in temperature would give higher number and therefore less relevance. The opposite could be done for say pressure-perhaps divide the difference by 10, or whatever is most fitting. Similar logic could be used to reduce relevance of moondistance. A weather analyst best decide that :) hope this helps!

MrCraze 2010-02-13 14:51:33

@MrCraze: I was hoping for a solution that could analyze values as a whole since I've no idea what constitutes greater importance.

Alix Axel 2010-02-13 22:40:07

Hi Alix, I think Pentium10 is on the right track. I would select the standard deviation of all the values first of all into variables to speed up the query, then divide the difference between the current date and that of the query by the standard deviation, adding these together for each element.Eg. SELECT @IDev = STDDEV(i), @JDev = STDDEV(j)... thenSELECT *, Rank (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)... FROM table ORDER BY (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)...This will add the number of standard deviations each value is from the specified parameter (ie, today) - give better weighting

MrCraze 2010-02-14 12:36:36

Answer 2

+1 A:

I'm not sure if this is relevant or not to the question, but some values (i, j, k) mean more is better while other values (x, y, z) mean the opposite: less is better.

Since you are dealing with dates, you may want to derive other dates based on these criteria.
For example you can compute new dates, for a value having a higher effect you add couple of hours/days/weeks/months to your date, and if a value has a lover effect you add just seconds/minutes to your dates. Then you sort based on this computed new date.
You can subtract dates using negative compensations if they reach a peak level

Check out Fuzzy SQL (FSQL)

EDIT 1

Try this: std(i)*abs(i-@i) ...

SELECT 
  (select std(i) from fuzzysearch)*abs(i-7) as v1,
  (select std(j) from fuzzysearch)*abs(j-90) as v2,
  f.* 
FROM fuzzysearch as f 
order by 
  v1 asc, 
  v2 desc

Pentium10 2010-02-13 12:06:49

Didn't knew MySQL had a STD() built-in function, thanks. Two questions: **1)** wouldn't your query be more favorable (in terms of sorting) to the `i` (`v1`) column? **2)** Why do you sort `v1` ASC and `v2` DESC?

Alix Axel 2010-02-13 14:05:27

I must've used the computed value because it differs if for example you set: 10 than `i`. And with `asc` and `desc` I wanted to emphasize that some columns have different order. If temp is equal a higher humidity is better. MySQL has also `VARIANCE()`.

Pentium10 2010-02-13 14:15:30

@Pentium10: Thank you, regarding the first question I asked I was not referring to the use of the computed values, but to the order of the `ORDER BY` clause itself. Shouldn't MySQL order by `v1` primarily and only order by `v2` if there are two or more `v1` values with the same "ranking" / order?

Alix Axel 2010-02-13 14:25:08

MySQL orders by v1 primarily and only order by v2 if there are two or more v1 values with the same ranking. That's true. `order by v1 asc, v2 desc`

Pentium10 2010-02-13 14:30:04

@Pentium10: In that case this will fail to order by all values as a whole. =\

Alix Axel 2010-02-13 14:39:45

Yeah it will. However I think helps you to get some answers, until you find a working formula to compute a value of all.

Pentium10 2010-02-13 15:25:16

Answer 3

+5 A:

The problem you seem to have is that each column has a different scale and so you can't easily combine them. This problem can be solved using a technique called whitening. This involves calculating the average and standard deviation of each column (you could do this in 1 SQL statement) and then rescaling each column to this when selecting:

colSortPos = (colValue-colMean) / colStdev

Doing this will give you each column ranging around 0 which +/- 1 standard deviation within the range +/- 1. The trick then is to combine these so that similar dates are together. The problem here is that this is not a 2 dimensional problem and so you need to think multidimensionally. So my suggestion its to take the the Euclidean distance as your sort order.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

The only problem with this is that it projects your problem onto a 1 dimensional space that may make you miss some correlations. To work around this I suggest using a clustering technique like K-means which is pretty simple to implement and is really fast. This will allow you to group your dates into k clusters that display the most similarity [ http://en.wikipedia.org/wiki/K-means_clustering ]. If you have the raw data and want to play around with these (and other) techniques then I suggest trying the weka toolkit [ http://www.cs.waikato.ac.nz/ml/weka/ ] which will let you play around with these techniques.

Neel 2010-02-15 17:30:08

I think I made a mistake in my SQL, the Euclidean distance I stated would only give you closest to the mean, if you want to compare with a particular row then you'd need the distance to that:SQRT( POW(@iTarget - ((i-@iMean)/@iStdDEv), 2) + POW(@jTarget - ((j-@jMean)/@jStdDEv), 2) + POW(@kTarget - (k-@kMean)/@kStdDEv, 2) )Of course the whitened values can be pre-calculated in the table and then you could make this calc easier.

Neel 2010-02-15 17:58:27

Hi Neel, really interesting answer - I am wondering why each result is squared and then the total square rooted? Also, is it necessary to use the mean at all - I tryed solving it by using ((@iTarget - i)/@iStdDev) instead - giving you the measure of distance between them in the number of standard deviations.

MrCraze 2010-02-18 10:46:01

You want to do this to get the distance from the point you're interested in to the point you're at. Think about a right angled triangle, to get the distance from (x1,y1) to (x2,y2) you find the hypotenuse which is sqrt((x1-x2)^2 + (y1-y2)^2). In 3D (x,y,z) this becomes sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2). Think of your columns as axes on a graph, the points that are most similar would be spacially closer together. The more columns you have the more dimensions you need to consider. Your intuition to do ((@iTarget - i)/@iStdDev) is correct as this will measure the distance to some target.

Neel 2010-02-18 13:02:16

@Neel: Like MrCraze said this is a really interesting answer, I'll try it in a couple of hours and post my feedback here - regarding Weka: it seems like a really nice project (I actually bough the "Data Mining - Practical Machine Learning Tools and Techniques" book by the same authors a while back) but I think I lack the math skills to fully understand how to use it.

Alix Axel 2010-02-18 23:38:24

@Alix Will be interested in the results, I have some ideas on how you could do this in a precomputed way to allow you to do arbitrary sorts but will hold back to see if this works out for you.

Neel 2010-02-19 13:01:52

Answer 4

+1 A:

This is hard (if not impossible) to do in SQL, but there's a multivariate analysis technique called Factor Analysis. It's a way of creating a "factor" - a linear combination of your variables which puts a weight on each input to get a "factor score". Basically it does a bunch of linear algebraic manipulations on your data to create a singles set of scores that minimizes some objective (like total errors).

I did a factor analysis on 3 days of hourly weather data and it looks pretty good. You can see that entries with similar factor scores generally are very close in values across all four measures. I used a principal component extractions with an Equimax rotation:

Wind    Air    Dewpoint    BP         Score
-------------------------------------------
3       12     9           1012.2     -2.72
5       17     11          1011.9     -1.77
5       16     10          1010.8     -1.75
6       15     10          1010.4     -1.68
3       19     13          1012.1     -1.57
5       17     11          1010.8     -1.54
7       17     12          1012.0     -1.53
8       19     12          1012.0     -1.24
5       20     14          1012.0     -1.18
7       17     12          1009.8     -1.06
9       20     13          1012.5     -1.05
8       21     12          1012.1     -1.05
7       19     11          1009.9     -0.98
6       18     13          1009.6     -0.90
0       24     17          1012.1     -0.90
8       20     12          1010.8     -0.88
5       22     15          1012.0     -0.87
10      21     12          1012.0     -0.86
8       21     12          1010.8     -0.78
9       19     12          1010.2     -0.78
8       23     10          1010.7     -0.76
6       23     15          1012.3     -0.75
7       20     12          1009.7     -0.73
7       20     12          1009.6     -0.71
10      22     14          1011.5     -0.45
7       19     15          1009.3     -0.45
6       20     16          1009.7     -0.41
7       20     15          1009.4     -0.37
10      24     12          1010.6     -0.26
5       26     18          1012.5     -0.26
9       23     15          1011.0     -0.22
12      24     16          1012.3     -0.04
6       25     16          1009.5      0.15
5       28     20          1012.5      0.16
0       28     17          1009.0      0.16
8       24     17          1010.2      0.17
3       30     20          1012.5      0.19
12      23     16          1010.6      0.21
8       25     15          1009.5      0.21
13      25     13          1010.5      0.22
13      25     14          1010.6      0.30
5       25     20          1010.1      0.35
6       25     19          1009.9      0.37
9       23     18          1009.5      0.40
13      25     15          1010.3      0.46
5       30     21          1012.4      0.48
7       26     19          1010.0      0.54
12      27     16          1010.9      0.56
8       24     20          1009.7      0.58
7       27     17          1009.2      0.60
7       27     18          1009.6      0.62
7       27     17          1009.0      0.64
8       26     21          1010.6      0.70
9       28     17          1009.8      0.75
8       25     22          1010.0      0.82
8       26     23          1010.8      0.86
8       25     22          1009.8      0.87
8       25     23          1010.1      0.91
9       26     22          1010.5      0.91
8       26     22          1009.8      0.97
14      29     17          1010.8      1.06
12      26     22          1010.8      1.09
10      31     18          1010.3      1.14
14      30     18          1010.7      1.28
17      29     17          1010.9      1.29
10      27     22          1009.4      1.32
12      26     22          1009.7      1.32
8       27     24          1009.3      1.38
14      27     22          1010.2      1.49
12      28     22          1009.8      1.51
16      31     19          1010.7      1.66
14      28     23          1009.6      1.82

Grembo 2010-02-16 22:29:27

Seems interesting and may be worth enough to try it, can you get me some more info on the factor analysis method you mentioned?

Alix Axel 2010-02-18 23:40:42

ansaurus

tags:

views:

answers:

Multi-Column Integer Ordering

related questions