views:

40

answers:

2

I've got a large set of captured data (potentially hundreds of thousands of records), and I need to be able to break it down so I can both classify it and also produce "typical" data myself. Let me explain further...

If I have the following strings of data:

132T339G1P112S
164T897F5A498S
144T989B9B223T
155T928X9Z554T
...

you might start to infer the following:

  • possibly all strings are 14 characters long
  • the 4th, 8th, 10th and 14th characters may always be alphas, while the rest are numeric
  • the first character may always be a '1'
  • the 4th character may always be the letter 'T'
  • the 14th character may be limited to only being 'S' or 'T'
  • and so on...

As you get more and more samples of real data, some of these "rules" might disappear; if you see a 15 character long string, then you have evidence that the 1st "rule" is incorrect. However, given a sufficiently large sample of strings that are exactly 14 characters long, you can start to assume that "all strings are 14 characters long" and assign a numeric figure to your degree of confidence (with an appropriate set of assumptions around the fact that you're seeing a suitably random set of all possible captured data).

As you can probably tell, a human can do a lot of this classification by eye, but I'm not aware of libraries or algorithms that would allow a computer to do it.

Given a set of captured data (significantly more complex than the above...), are there libraries that I can apply in my code to do this sort of classification for me, that will identify "rules" with a given degree of confidence?

As a next step, I need to be able to take those rules, and use them to create my own data that conforms to these rules. I assume this is a significantly easier step than the classification, but I've never had to perform a task like this before so I'm really not sure how complex it is.

At a guess, Python or Java (or possibly Perl or R) are possibly the "common" languages most likely to have these sorts of libraries, and maybe some of the bioinformatic libraries do this sort of thing. I really don't care which language I have to use; I need to solve the problem in whatever way I can.

Any sort of pointer to information would be very useful. As you can probably tell, I'm struggling to describe this problem clearly, and there may be a set of appropriate keywords I can plug into Google that will point me towards the solution.

Thanks in advance

A: 

For starters, you can't really expect to get the computer to identify arbitrarily complicated rules. Same is true of a human analyzing the strings; I'm sure you can think of some examples of rules that could apply but that no human could ever be expected to figure out just from looking at the strings.

What I think you would need to do is program the computer with certain kinds of rules that it can identify. For example, you could write a script that identifies rules of the form "The string length is always X." Or even "The Nth character is always X" wouldn't be too hard. I notice that the example rules you mentioned are all of this form, so it wouldn't be too far off from a human analysis ;-) In fact, if you know, or can assume, that the choice of the character that appears in a given position is based only on the positional index, you could use your data to estimate the probability that a given character appears in a given spot, which would be like a more general version of "The Nth character is always X."

If you want to establish a confidence level for your rules, I'd suggest looking into Bayesian statistics, which is used when you want to revise the probability of a hypothesis (such as "this rule is correct") as you collect new evidence.

David Zaslavsky
Thanks for your response. If there really is no better option than to construct what amounts to a big bunch of "if" statements with explicit parameters, then I'll mark your answer as accepted. However, I'm inclined to think there's probably something in e.g. Python's bioinformatics or NLTK libraries that could be a good fit - I just don't know enough about these fields to be able to construct a suitable question
monch1962
You're right, there may be something better than a list of "if" statements or equivalent, but I doubt that you're going to find something _much_ better. This is getting into the realm of artificial intelligence - not that I'm an expert on that or anything, but I know AI development is still fairly primitive.
David Zaslavsky
A: 

Try Weka which has clustering algorithms. Clustering algorithms find patterns in data without supervision. Weka also has incremental clusterers. Exactly what you want, I think.

And it's Java.

Allen
I should add that your problem can be described as a clustering problem.
Allen