N-Grams for sequence correction

Sorry for the difficult question.

I have a large set of sequences to be corrected by either/or adding digits or replacing them (never removing anything) that looks like this:

1,2,,3 => 1,7,4,3
4,,5,6 => 4,4,5,6
4,7,8,9 => 4,7,8,9,1
4,7 => 4,8
4,7,1 => 4,7,2

It starts with a padded original sequence, and a sample correction.

I'd like to be able to work on correcting the sequences automatically by calculating the frequencies of the different n-grams being corrected, the first sample would become

1=>1
2=>7
3=>3
1,2=>1,7
2,3=>7,4,3
1,2,3=>1,7,4,3

I'd collect the frequency of these n-grams corrections, and I'm looking for a way to calculate the best way to correct a new input that may or may not be in the sample data.

This seems to be similar to SMT.

Assign known replacements a score, based on the length of the replacement and the number of occurrences. Naively, I would suggest making this score proportional to the square of the length (longer matches being rarer, in most scenarios I can think of) and the square root of the number of occurrences, such that a 4-item sequence has as much weight as a 2-item sequence that occurs 16 times as often. This would need to be adjusted based on your actual situation.

Given a sequence of length M, there are N substrings of lengths 1 to M, where N=M*(M+1)/2, so if the strings are reasonably short then you could iterate over every substring and look up possible replacements. The number of ways to compose the whole string out of these substrings is also proportional to M^2, I think.

For every possible composition of the original string by substrings, add up the total score of the best (highest score) replacement for each substring.

The composition with the highest total score will be (potentially, given my assumptions about the process) the "best" post-replacement result.

ansaurus

tags:

views:

answers:

N-Grams for sequence correction

related questions