views:

115

answers:

5

I'm looking for something that I guess is rather sophisticated and might not exist publicly, but hopefully it does.

I basically have a database with lots of items which all have values (y) that correspond to other values (x). Eg. one of these items might look like:

x | 1 | 2 | 3 | 4 | 5
y | 12 | 14 | 16 | 8 | 6

This is just a a random example. Now, there are thousands of these items all with their own set of x and y values. The range between one x and the x after that one is not fixed and may differ for every item.

What I'm looking for is a library where I can plugin all these sets of Xs and Ys and tell it to return things like the most common item (sets of x and y that follow a compareable curve / progression), and the ability to check whether a certain set is atleast x% compareable with another set.

With compareable I mean the slope of the curve if you would draw a graph of the data. So, not actaully the static values but rather the detection of events, such as a high increase followed by a slow decrease, etc.

Due to my low amount of experience in mathematics I'm not quite sure what I'm looking for is called, and thus have trouble explaining what I need. Hopefully I gave enough pointers for someone to point me into the right direction.

I'm mostly interested in a library for javascript, but if there is no such thing any library would help, maybe I can try to port what I need.

+2  A: 

What you're wanting to do is ANOVA, or ANalysis Of VAriance. If you run the numbers through an ANOVA test, it'll give you information about the dataset that will help you compare one to another. I was unable to locate a Javascript library that would perform ANOVA, but there are plenty of programs that are capable of it. Excel can perform ANOVA from a plugin. R is a stats package that is free and can also perform ANOVA.

Hope this helps.

DBA_Alex
Rather disappointing there are no libraries for ANOVA for Javascript. I can't find good ones for Java either.
Tom
Please, do NOT use the one of Excel. Excel is far from optimal.
Joris Meys
None of these programs are optimal. I'm in need of a programming library.
Tom
+1  A: 

Something simple is (assuming all the graphs have 5 points, and x = 1,2,3,4,5 always)

  • Take u1 = the first point of y, ie. y1
  • Take u2 = y2 - y1
  • ...
  • Take u5 = y5 - y4

Now consider the vector u as a point in 5-dimensional space. You can use simple clustering algorithms, like k-means.

EDIT: You should not aim for something too complicated as long as you go with javascript. If you want to go with Java, I can suggest something based on PCA (requiring the use of singular value decomposition, which is too complicated to be implemented efficiently in JS).

Basically, it goes like this: Take as previously a (possibly large) linear representation of data, perhaps differences of components of x, of y, absolute values. For instance you could take u = (x1, x2 - x1, ..., x5 - x4, y1, y2 - y1, ..., y5 - y4)

You compute the vector u for each sample. Call ui the vector u for the ith sample. Now, form the matrix

M_{ij} = dot product of ui and uj

and compute its SVD. Now, the N most significant singular values (ie. those above some "similarity threshold") give you N clusters.

The corresponding columns of the matrix U in the SVD give you an orthonormal family B_k, k = 1..N. The squared ith component of B_k gives you the probability that the ith sample belongs to cluster K.

Alexandre C.
1) I need to detect simularities between parts of each input too. Eg. if from x1 to x2 is 75% similar to x1 to x2 in a different set, this should be returned. 2) the sets will have different amounts of x and the ranges between xi and xi+1 will be different too (x is a timestamp). This basically is very sophisticated, which is why I'm looking for an existing library.
Tom
@Tom: Add variables. This is still flexible.
Alexandre C.
+2  A: 

What you're looking for is an implementation of a Markov clustering. It is often used for finding groups of similar sequences. Porting it to Javascript, well... If you're really serious about this analysis, you drop Javascript as soon as possible and move on to R. Javascript is not meant to do this kind of calculations, and it is far too slow for it. R is a statistical package with much implemented. It is also designed specifically for very speedy matrix calculations, and most of the language is vectorized (meaning you don't need for-loops to apply a function over a vector of values, it happens automatically)

For the markov clustering, check http://www.micans.org/mcl/ An example of an implementation : http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi

Now you also need to define a "distance" between your sets. As you are interested in the events and not the values, you could give every item an extra attribute being a vector with the differences y[i] - y[i-1] (in R : diff(y) ). The distance between two items can then be calculated as the sum of squared differences between y1[i] and y2[i].

This allows you to construct a distance matrix of your items, and on that one you can call the mcl algorithm. Unless you work on linux, you'll have to port that one.

Joris Meys
Best answer, thanks. (Though not what I was hoping for, this promises a lot of work).
Tom
@Tom : I'm sorry for you, it definitely is a lot of work. you might want to check on stats.stackexchange.com as well for further guidance though. I merely gave you an outline of a possible analysis. Depending on the structure and peculiarities of your data, you might want to check other options. I learnt the hard way that it's better to spend some time in looking at all options than to head off one way, only to find out that it's actually not for your kind of data... Good luck with it.
Joris Meys
+3  A: 

About Markov Cluster(ing) again, of which I happen to be the author, and your application. You mention you are interested in trend similarity between objects. This is typically computed using Pearson correlation. If you use the mcl implementation from http://micans.org/mcl/, you'll also obtain the program 'mcxarray'. This can be used to compute pearson correlations between e.g. rows in a table. It might be useful to you. It is able to handle missing data - in a simplistic approach, it just computes correlations on those indices for which values are available for both. If you have further questions I am happy to answer them -- with the caveat that I usually like to cc replies to the mcl mailing list so that they are archived and available for future reference.

Stijn van Dongen
+1  A: 

If it is ok to use java you really should have a look at Weka. It is possible to access all features via java code. Maybe you find a markov clustering, but if not, they hava a lot other clustering algorithem and its really easy to use.

InsertNickHere