tags:

views:

216

answers:

7

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.

I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.

Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.

I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.

So, here's the code:

private string RemoveNonEssentialWords(string phrase)
{
    //This array is being created manually for demo purposes.  In production code it's passed in from elsewhere.
    string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive", 
                                    "upper", "lower", "complete", "partial", "subacute", "severe",
                                    "moderate", "total", "small", "large", "minor", "multiple", "early",
                                    "major", "bilateral", "progressive"};
    int index = -1;

    for (int i = 0; i < nonessentials.Length; i++)
    {
        index = phrase.ToLower().IndexOf(nonessentials[i]);
        while (index >= 0)
        {
            phrase = phrase.Remove(index, nonessentials[i].Length);
            phrase = phrase.Trim().Replace("  ", " ");
            index = phrase.IndexOf(nonessentials[i]);
        }
    }

    return phrase;
}

Thanks in advance for your help.

Cheers,

Steve

+3  A: 

I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.

private string RemoveNonEssentialWords(string phrase)
{
    return Regex.Replace(phrase, // input
                         @"\b(" + String.Join("|", nonessentials) + @")\b", // pattern
                         "", // replacement
                         RegexOptions.IgnoreCase)
           .Replace("  ", " ");
}

The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.

Gabe
Though you'd have to dynamically build the regular expression based on the word list, since it's a parameter to the function in the production version, and not a constant array.
qstarin
+1  A: 

Yeah, that smells.

I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).

For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)

qstarin
+5  A: 

I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.

Achim
I was just going to point that out, but using 'lefty' to 'y'.
Mark
I picked up on the code flaw as well. Rather than working on the phrase as a single string, it'll need to be manipulated as a set of words.
Ben Griswold
Also if "lefacutet" appears in the phrase, it will remove "acute" and leave "left", even though "left" is a nonessential.
Brian
Or "smallpox in minority population" becomes "pox in ity population". Get the algorithm *correct* first and then worry about matters of style.
Eric Lippert
Excellent points all. As I said, I didn't write the code and it was a bit difficult to read, so I didn't notice the points made. Aaaronaught's tokenization suggestion handles not only the concerns I had, but these points as well.
Steve Brouillard
+1  A: 

I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).

rerun
A: 

How does this look?

        foreach (string nonEssent in nonessentials)
        {
            phrase.Replace(nonEssent, String.Empty);
        }
        phrase.Replace("  ", " ");
Andrew M
This works pretty much the same as the original code, an so still suffers from all the problems other posters have pointed out with the original code, but it's cleaner and easier to read. A parser /state machine that splits the input into words might be better overall.
Andrew M
+10  A: 

This appears to be an algorithm for removing stop words from a search phrase.

Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?

One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:

public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
    var tokens = Tokenize(phrase);
    var filteredTokens = tokens.Where(s => !stop.Contains(s));
    return string.Join(" ", filteredTokens.ToArray());
}

public static IEnumerable<string> Tokenize(string phrase)
{
    return string.Split(phrase, ' ');
    // Or use a regex, such as:
    //    return Regex.Split(phrase, @"\W+");
}

This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.

Aaronaught
I also like splitting the input into separate words. Will also be useful to apply future logic on each search term. e.g. Spelling check
Philip Fourie
If performance needs to be maximized, it's worth saying that this method still contains a lot of inefficiency. Tokenizing the input string will create, obviously, as many separate strings as there are words in the input string. Also, creating the array and re-joining the words could take some time if the input is large.
qstarin
@qstarin: Considering that a search phrase is unlikely to be more than, oh, about 10 words or so, I doubt it's going to pose a significant problem. Pass a `HashSet<string>` for the `stop` argument and this becomes O(N) over the number of words; worrying about performance beyond this point becomes premature optimization IMO. Aim for clean, readable code that performs *reasonably* well first; then, if that's not good enough, you can start making micro-optimizations.
Aaronaught
@Philip Fourie: Indeed, and we haven't even talked about word derivatives, (partial) homonyms and misspellings. A good stop word algorithm would probably remove *totally* as well as *total*, and also remove *exessive* as an obvious misspelling of *excessive*. These substitutions become more and more difficult to do with any search-and-replace approach.
Aaronaught
Aaronaught. Thanks, I think this will take me where I need to go. qstarin has a point regarding the creation of strings, however, the stop word list is limited in size and scope and so, ultimately, is the search phrase. Search phrases will average less than 4 words and very seldom exceed 5 or 6. The stop list is relatively static and currently contains less than 30 words.Thanks.
Steve Brouillard
Make sure you eliminate any punctuation, though, if you're going to be splitting on space.
Gabe
@aaronaught: Entirely agree. Which is why the first word I said was "If."
qstarin
A: 

If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:

Start by creating a Regex from the array input. Something like:

var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";

That will result in something like:

\b(left|right|chronic)\b

Then create a Regex object to do the find/replace:

System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);

Then you can just do a Replace like so:

string fixedPhrase = regex.Replace(phrase, "");
statichippo