views:

66

answers:

1

I have some kind of object model and I need to filter and sort it's nodes for some kind of property. What kinds of automated systems exist to generate and select properties of the object model that correlate to what I want? (I'm intentionally being abstract and non-specific)

I'm thinking of a system that works kind of like spam filters or supervised classification systems in that given an example data set it identifies rules that find nodes of interest. However I'm looking for a more general system in that it shouldn't require any design time information about the object model. It should work equality well as a spam filter on e-mail, a bug finder on a code base, an interest filter in a newsgroup or bot accounts finder on a social networking site. As long as it can explore the object model via reflection and be given a set of "interesting" nodes, it should be able to find rules that will find more nodes like them.

A: 

It is highly unlikely that there is a single automated classification system which could do all that you are asking. Additionally, I believe the bug finder application falls outside the scope of such a system since the methods which are being successfully used in that domain largely revolve around syntactic analysis, data flow analysis, and other algorithmic methods highly tailored to issues surrounding software errors. Although machine learning research is being done there, the classification systems in this domain are mostly being used to augment rather than replace analytical methods (so far as I know).

For most non-trivial classification problems, careful selection and refinement of the problem representation is typically required in order to get useful and effective results via machine learning. Simply using the existing "raw" data object model without some sort of tailored transformation of the state space tends to lead to either incomplete coverage of the distribution of input data values and/or poor generalization of the learned classifiers. Additionally, other parameters specific to the machine learning method being used may require trial-and-error tweaking to get decent results for a given problem. Not all methods have such parameters, but many do, such as neural networks, genetic algorithms, bayesian inference methods, etc.

What you are asking for is a nearly universal machine learning method, which is not something which currently exists. The most viable alternatives that I can see would be to (1) find a subset of different problems for which this would not be the level of capability/sophistication required, or (2) create a system which uses not just one classification technique but rather has a toolbox of different methods that it automatically tests out against a given problem and then uses the one which generates the best classification results under a supervising learning regime. The latter would still be quite a challenge to pull off effectively though, and it does not eliminate the problem of how to represent/transform the state space for the data model.

Joel Hoff
The last half of the last sentence "how to represent/transform the state space for the data model." actually describes exactly the problem I'm wondering about solutions for.
BCS
One possibility for the state space model issue is to expand on the "toolbox" concept and have a variety of different representations that are automatically tested out. This could include (1) no transformation [which might work OK for some types of discrete-valued or text data], (2) conceptual clusters or ontologies for natural language, (3) coarse-coding representations for numerical data, etc. These would also be generic schemes that offer a decent chance of dividing up the state space in a useful way to more easily learn patterns, but lack the capabilities of more tailored approaches.
Joel Hoff