views:

95

answers:

1

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.

However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.

I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?

EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.

A: 

I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.

I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.

bitc
Thanks for the response. Maybe I didn't explain it well, but I have to pick out such near-match strings out of a much longer string, e.g. [mein] out of [aiammeinlimeiking...] where I'm trying to extract close matches such as [mein] and [meik], with scores of 0.0 (exact match) and 0.7 respectively. I'm not just comparing two strings and calculating their difference, so I'm not really sure if Soundex and the other algorithms would help. If I'm wrong, do let me know.