n-grams using regex | ansaurus

tags:

c#
regex

views:

103

answers:

2

+2 Q:

n-grams using regex

I am working on an augmentative and alternative communication (AAC) program. My current goal is to store a history of input/spoken text and search for common phrase fragments or word n-grams. I am currently using an implementation based on the lzw compression algorithm as discussed at CodeProject - N-gram and Fast Pattern Extraction Algorithm. This approach although producing n-grams does not behave as needed.

Let's say for example that I enter "over the mountain and through the woods" several times. My desired output would be the entire phrase "over the mountain and through the woods". Using my current implementation the phrase is broken into trigrams and on each repeated entry one word is added. So on the first entry I get "over the mountain". On the second entry "over the mountain and", etc.

Let's assume we have the following text:

this is a test
this is another test
this is also a test
the test of the emergency broadcasting system interrupted my favorite song

My goal would be that if "this is a test of the emergency broadcasting system" were entered next that I could use that within a regex to return "this is a test" and "test of the emergency broadcasting system". Is this something that is possible through regex or am I'm walking the wrong path? I appreciate any help.

A:

I have been unable to find a way to do what I need with regular expressions alone although the technique shown at Matching parts of a string when the string contains part of a regex pattern comes close.

I ended up using a combination of my initial system along with some regex as shown below.

flow chart

This parses the transcript of the first presidential debate (about 16,500 words) in about 30 seconds which for my purposes is quite fast.

Jeff 2010-01-25 21:24:44

A:

From your use case it appears you do not want fixed-length n-gram matches, but rather a longest sequence of n-gram match. Just saw your answer to your own post, which confirms ;)

Dom Brezinski 2010-01-25 21:32:34

related questions

Displaying Flash content in a C# WinForms application

How to get the value of built, encoded ViewState?

Unhandled Exception Handler in .NET 1.1

How do I connect to a database and loop over a recordset in C#?

How do I most elegantly express left join with aggregate SQL as LINQ query

Get a new object instance from a Type in C#

.NET Testing Framework Advice

Automatically update version number

What is the difference between an int and an Integer in Java/C#?

How to write to Web.Config in Medium Trust ?

WinForms ComboBox data binding gotcha

How do you sort a C# dictionary by value?

Adding Scripting functionality to .NET applications

Floating Point Number parsing: Is there a Catch All algorithm?

How do I print an HTML document from a web service?

Decoding T-SQL CAST in C#/VB.net

Anatomy of a "Memory Leak"

How do I get a distinct, ordered list of names from a DataTable using Linq

Reliable Timer in a Console Application

How do I fill a DataSet or a DataTable from a LINQ query resultset ?

What's the difference between Math.Floor() and Math.Truncate() in .NET?

How do I calculate relative time?

How do I calculate someone's age in C#?

Are there any conversion tools for porting Visual J# code to C#?

When setting a form's opacity should I use a decimal or double?