views:

73

answers:

3

Hi,

I'm usually pretty adept with algorithms, but I've got a pretty abstract question here, which is probably someone's PhD project somewhere, and bordering on NP completeness. But maybe it's a more common problem than I think.

I have a list of 25000 Strings, created using a bunch of drop down lists and text fields. So, to simplify the discussion, lets say this is the, er, unidirectional graph:

{My Cat/My Dog} had {kittens, puppies}.

So, this is like a tree structure whose 4 paths represent 4 possible sentences.

How would one reverse engineer the tree structure from a (possibly incomplete) list of sentences?

i.e.

So that from
My Cat had kittens
My Cat had puppies
My Dog had kittens
, you could still recreate the original syntax tree.

Obviously with 25000 Strings, this will take a while. But is there any software out there that does this? Or, is this a common enough problem that there are known algorithms to do this?

It seems like a regex parser in nature, but I don't know where to begin. I'm dealing with this problem at work, and doing my own analysis of the sentences to parse another 500 or so, every time I find a new pattern. But I reckon if I had the tree structure, I could do it chop chop.

Any ideas? Thanks

+2  A: 

Could templatemaker be a step in the right direction for you? Its goal is to infer the templates behind similarly formatted strings, later allowing you to use this template to extract the data from other strings.

Eli Bendersky
thanks, that looks pretty good. I just realised that whatever algorithm diff uses is probably also pretty much what I am looking for.
djb
@djb: templatemaker uses a kind-of superset of diff - i.e. same longest-common-substring building blocks
Eli Bendersky
+1  A: 

But maybe it's a more common problem than I think.

I believe that this is known as grammar inference or grammar induction.

Rafał Dowgird
+2  A: 

This may come under the heading of learning Finite Automata, in which case it is genuinely a hard problem, at least with the standard assumptions of that field. However, I suspect that your case is easier than most, because you know that the machine is in a single start state at the beginning if each string.

If looking up learning Finite Automata is too depressing, you could just get hold of some code for fitting Hidden Markov Models, let it loose, and hope for the best.

mcdowella