views:

74

answers:

3

I am writing a program in C# that compares strings similarly to the way that Google searches documents for keywords.

I am wanting a search for "stack overflow" to return true for "stack overflow" (plain), "This is the stack overflow." (in the middle), "Welcome to Stack Overflow." (case insensitive), "I like stack overflow." (variable whitespace), and "Who puts a dash in stack-overflow?", but not "stackoverflow" (no whitespace).

I was thinking that I could use a regular expression like "stack([ -]|. )+overflow", it seems overkill to have to replace every space in each keyword with a character set for each new keyword. Because "stack overflow" is not the only string I am searching, I have to do it pragmatically.

+1  A: 

If you simply want to achieve the effect in the specific case you mentioned, you can use regular expressions to replace the tokens you want to ignore by a single space (or empty string).

If you want a more elaborate solution, you could use dynamic programming to get the smallest permutation required to transform the first string into the second. This will allow matching with (few) missing letters or typos, too.

André Caron
I am looking for more than just the specific "stack overflow" case as I will have hundreds of keywords to compare. For the time being, I will replace all spaces with the regex character class, but I'm hoping something else will arise.
Benjamin Manns
I meant the specific "ignore case, whitespace and punction" case. If you want to handle possible matches in the case of typos, then you'll be looking for something more evolved, such as dynamic programming. If you compare the input string with all possible matches, it will return a match cost for each string and you can select the keyword with the lowest cost as a "match" with a certain degree of uncertainty (high match cost -> unlikely match).Nothing prevents you from using a hybrid approach: replace case, whitespace, punctuation, and then use dynamic programming.
André Caron
A: 

If you are comparing against short strings, then the easiest way that I can see would be to strip out all of the white space and other characters from both strings and do a simple string.Contains.

Kragen
+1  A: 

To meet your specification, you could first do

newSearchString = Regex.Replace(Regex.Escape(searchString), @"\s+", @"[\s\p{P}]+");

(to transform your plain text search string into a regular expression that also allows punctuation in places where there used to be only whitespace), and then apply that regex to whatever text you're searching.

But of course this will fail to match on the slightest typo whereas an algorithm using Levensthein distance will also match "Stak Overfloor".

Tim Pietzcker