tags:

views:

103

answers:

2

I am working on an augmentative and alternative communication (AAC) program. My current goal is to store a history of input/spoken text and search for common phrase fragments or word n-grams. I am currently using an implementation based on the lzw compression algorithm as discussed at CodeProject - N-gram and Fast Pattern Extraction Algorithm. This approach although producing n-grams does not behave as needed.

Let's say for example that I enter "over the mountain and through the woods" several times. My desired output would be the entire phrase "over the mountain and through the woods". Using my current implementation the phrase is broken into trigrams and on each repeated entry one word is added. So on the first entry I get "over the mountain". On the second entry "over the mountain and", etc.

Let's assume we have the following text:

this is a test
this is another test
this is also a test
the test of the emergency broadcasting system interrupted my favorite song

My goal would be that if "this is a test of the emergency broadcasting system" were entered next that I could use that within a regex to return "this is a test" and "test of the emergency broadcasting system". Is this something that is possible through regex or am I'm walking the wrong path? I appreciate any help.

A: 

I have been unable to find a way to do what I need with regular expressions alone although the technique shown at Matching parts of a string when the string contains part of a regex pattern comes close.

I ended up using a combination of my initial system along with some regex as shown below.

flow chart

This parses the transcript of the first presidential debate (about 16,500 words) in about 30 seconds which for my purposes is quite fast.

Jeff
A: 

From your use case it appears you do not want fixed-length n-gram matches, but rather a longest sequence of n-gram match. Just saw your answer to your own post, which confirms ;)

Dom Brezinski