views:

111

answers:

2

Hi,

I'm working on a cluster analysis program that takes a set of points S as an input and labels each point with that index of the cluster it belong to. I've implemented the DBScan and OPTICS algorithms and they both work as expected. However, the results of those algorithms can be very different depending on the initial values of MinPts and Epsilon. I've searched all over the net and read lots of papers about data mining and cluster analysis and yet I can't seem to find a way of analyzing the data without needing MinPts and Epsilon to determine if a point is in such a cluster. I'm guessing density based cluster analysis is not the way to go in my case.

Does anyone have an idea or know about an algorithm I could use that wont require that kind of configuration ? Or simply point me in the right direction. Any help is welcome.

Thanks!

It's a school project I'm trying to finish, in which I have a set of 2D coords representing points on a plane, and I have to determine what cluster each point belongs to. Now I've done that using OPTICS and it work fine but I need to tweak the Eps value so that my output matches the example outputs I'm given. But since I have no description of what a cluster is in the subject, or what it's characteristics are, there is no way I can base myself solely on the distance between the points, or the density of points in a given region. Also, I do not know the number of clusters in advance, hence my use of the OPTICS algorithm. So in my opinion, either I'm doing it very wrong, either there is a crucial piece of info missing in the subject. And also, I'm not looking for anyone to do my homework or give me any source code, just some ideas or guidance, since I'm pretty much lost as how get the exact results given in the data set examples (I'm also not allowed to get any wrong values, if I do they consider the project is a failure, so algorithms with error margins can't be used).

Thanks again, and sorry for the long post.

+1  A: 

In general a set of points can be assigned to clusters in more than one way (e.g. they can all be assigned to one big cluster, or divided into two or three), so you have to have some parameters.

Why do you object to MinPts and Epsilon? If you don't like what happens when you change them, don't change them. Seriously.

EDIT:
What a bizarre assignment! Your clustering must match theirs perfectly, with no other clues? I will assume that they are neither idiots nor sadists and make the following guess: in the examples, there is a "natural" clustering that is obvious to the eye. Am I right? If so then there is a way we can set the parameters programmatically, as a function of distances in the point set. How many examples are there, and is it possible to post one?

EDIT:
Hah! I knew it! Here is a rule that will correctly divide this case into clusters: find the biggest distance from any point to its nearest neighbor, and if any two points are less than twice that distance apart, they belong to the same cluster. I'll bet it will work on the other cases too.

Beta
I don't object to using MinPts and Eps. In the current version of my program I have emplemented the OPTICS algorithm, but it seems to me that using distance based analysis is pretty vague, you could use any kind of Eps value and have very different results, and I'm just lost as to what other "type" of algorithm I should/could use.
__dominic
Do you think there would be a way of calculating optimal Eps and MinPts values based on the input data set ?
__dominic
@__dominic: "optimal" in what sense? You have to decide what it is that you're trying to achieve, what qualities you want in the clusters, what problems you're trying to avoid. Can you give us more information about how this tool will be used?
Beta
I've uploaded an example here: http://bit.ly/aRSsh8There are 5 examples in all, and yes, as you'll see the clustering is visible to the eye, but the algorithm can't guess that on it's own. I have tried something based on what you said, I've plotted out my points before running the algo and then used the actual mapped coords (from the coords in the file to the pixel position the point is plotted at) and used a fixed value for Eps, and while I get the correct results, there are times I have a few wrong values, and I need to mactch all values exactly.Thanks for taking some time to help me out.
__dominic
I'm not sure I know what you mean when you say: "find the biggest distance from any point to its nearest neighbor", would that not just be the smallest distance from a given point to any other point ? Or do you mean the biggest distance from any point to its nearest neighbor, where a neighbor is a point at <= Eps distance from that given point ? In the case I still have an arbitrary Eps value.
__dominic
@__dominic: Each point has a distance to its nearest neighbor; choose the largest of these distances.
Beta
Thanks alot that takes care of the few errors I had when I tried using the pixel positions from the mapped coords. I need to work on performance now as finding the greatest min dist is pretty slow, and I may be using the wrong algo now that I have a solution to this weird assignment (because it is a little weird that we have no info about what hey consider a cluster to be). Anyway, thank you for your help, much appreciated!
__dominic
__dominic: yes, it is O(n^2). Note that you don't have to call `sqrt` every time, you can use distance-squared. And there are ways to speed things up even more if you need them.
Beta
A: 

You could try looking into the many other cluster algorithms out there. You have probabilistic clustering (EM), partitional clustering (KMeans), hierarchical clustering, and many others... Of course each requires a different kind of configuration

Also make sure to try Weka, an open source tool containing lots and lots of machine learning algorithms (classification, clustering, pre-processing, ...). I believe it has an implementation (Java) for all those algorithms mentioned.


Edit: The question of determining which clustering is best is very domain-dependent. And it all comes down to how the clusters are put to use in the context of your application that determines how useful they are (Besides there could be more than one natural clustering of your data).

Amro