views:

50

answers:

3

Hi all,

I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:

A  B  C  D  E  Result
1  2  b  5  3  X
1  2  c  5  4  X
1  2  e  5  2  X

We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.

C is a discrete value, which could be any of a-e. The rest are just integers.

The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.

result("X") if $A >= 1 && $A <= 10 && $C eq 'A';

As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:

result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);

The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:

A  B  C    D  E    Result
1  2  bce  5  2-4  X

The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:

A  B  C   Result
1  2  a   X
1  2  b   X
(repeat a few hundred times)
2  4  a   X  
2  4  b   X
(ditto)

In an ideal world, this would be two rules:

A  B  C  Result
1  2  *  X
2  4  *  X

That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)

Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.

A: 

You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.

This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).

ezod
Unfortunately we need to be able to introspect the exact rules, although your idea would be excellent for many other use cases.
rjh
+1  A: 

Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.

Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.

The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.

Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.

-Al.

A. I. Breveleri
+1  A: 

Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.

dsimcha
Accepting - I have implemented a simple classification algorithm which takes advantage of relationships and implications that I discovered by using Weka. Thanks.
rjh