ansaurus

Question

Classifying captured data in unknown format?

Answer 1

A:

For starters, you can't really expect to get the computer to identify arbitrarily complicated rules. Same is true of a human analyzing the strings; I'm sure you can think of some examples of rules that could apply but that no human could ever be expected to figure out just from looking at the strings.

What I think you would need to do is program the computer with certain kinds of rules that it can identify. For example, you could write a script that identifies rules of the form "The string length is always X." Or even "The Nth character is always X" wouldn't be too hard. I notice that the example rules you mentioned are all of this form, so it wouldn't be too far off from a human analysis ;-) In fact, if you know, or can assume, that the choice of the character that appears in a given position is based only on the positional index, you could use your data to estimate the probability that a given character appears in a given spot, which would be like a more general version of "The Nth character is always X."

If you want to establish a confidence level for your rules, I'd suggest looking into Bayesian statistics, which is used when you want to revise the probability of a hypothesis (such as "this rule is correct") as you collect new evidence.

David Zaslavsky 2010-05-19 07:34:35

Thanks for your response. If there really is no better option than to construct what amounts to a big bunch of "if" statements with explicit parameters, then I'll mark your answer as accepted. However, I'm inclined to think there's probably something in e.g. Python's bioinformatics or NLTK libraries that could be a good fit - I just don't know enough about these fields to be able to construct a suitable question

monch1962 2010-05-19 11:44:48

You're right, there may be something better than a list of "if" statements or equivalent, but I doubt that you're going to find something _much_ better. This is getting into the realm of artificial intelligence - not that I'm an expert on that or anything, but I know AI development is still fairly primitive.

David Zaslavsky 2010-05-19 23:45:04

Answer 2

A:

Try Weka which has clustering algorithms. Clustering algorithms find patterns in data without supervision. Weka also has incremental clusterers. Exactly what you want, I think.

And it's Java.

Allen 2010-05-23 19:34:19

I should add that your problem can be described as a clustering problem.

Allen 2010-05-24 19:41:38

ansaurus

tags:

views:

answers:

Classifying captured data in unknown format?

related questions