ansaurus

Question

.NET String parsing performance improvement - Possible Code Smell

Answer 1

+3 A:

I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.

private string RemoveNonEssentialWords(string phrase)
{
    return Regex.Replace(phrase, // input
                         @"\b(" + String.Join("|", nonessentials) + @")\b", // pattern
                         "", // replacement
                         RegexOptions.IgnoreCase)
           .Replace("  ", " ");
}

The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.

Gabe 2010-03-09 17:53:01

Though you'd have to dynamically build the regular expression based on the word list, since it's a parameter to the function in the production version, and not a constant array.

qstarin 2010-03-09 17:57:18

Answer 2

+1 A:

Yeah, that smells.

I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).

For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)

qstarin 2010-03-09 17:53:46

Answer 3

+5 A:

I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.

Achim 2010-03-09 17:54:24

I was just going to point that out, but using 'lefty' to 'y'.

Mark 2010-03-09 18:00:17

I picked up on the code flaw as well. Rather than working on the phrase as a single string, it'll need to be manipulated as a set of words.

Ben Griswold 2010-03-09 18:00:52

Also if "lefacutet" appears in the phrase, it will remove "acute" and leave "left", even though "left" is a nonessential.

Brian 2010-03-09 18:21:56

Or "smallpox in minority population" becomes "pox in ity population". Get the algorithm *correct* first and then worry about matters of style.

Eric Lippert 2010-03-09 18:36:42

Excellent points all. As I said, I didn't write the code and it was a bit difficult to read, so I didn't notice the points made. Aaaronaught's tokenization suggestion handles not only the concerns I had, but these points as well.

Steve Brouillard 2010-03-09 19:29:12

Answer 4

+1 A:

I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).

rerun 2010-03-09 17:54:54

Answer 5

A:

How does this look?

        foreach (string nonEssent in nonessentials)
        {
            phrase.Replace(nonEssent, String.Empty);
        }
        phrase.Replace("  ", " ");

Andrew M 2010-03-09 17:57:05

This works pretty much the same as the original code, an so still suffers from all the problems other posters have pointed out with the original code, but it's cleaner and easier to read. A parser /state machine that splits the input into words might be better overall.

Andrew M 2010-03-09 17:59:47

Answer 6

+10 A:

This appears to be an algorithm for removing stop words from a search phrase.

Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?

One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:

public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
    var tokens = Tokenize(phrase);
    var filteredTokens = tokens.Where(s => !stop.Contains(s));
    return string.Join(" ", filteredTokens.ToArray());
}

public static IEnumerable<string> Tokenize(string phrase)
{
    return string.Split(phrase, ' ');
    // Or use a regex, such as:
    //    return Regex.Split(phrase, @"\W+");
}

This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.

Aaronaught 2010-03-09 18:04:07

I also like splitting the input into separate words. Will also be useful to apply future logic on each search term. e.g. Spelling check

Philip Fourie 2010-03-09 18:12:09

If performance needs to be maximized, it's worth saying that this method still contains a lot of inefficiency. Tokenizing the input string will create, obviously, as many separate strings as there are words in the input string. Also, creating the array and re-joining the words could take some time if the input is large.

qstarin 2010-03-09 18:32:06

@qstarin: Considering that a search phrase is unlikely to be more than, oh, about 10 words or so, I doubt it's going to pose a significant problem. Pass a `HashSet<string>` for the `stop` argument and this becomes O(N) over the number of words; worrying about performance beyond this point becomes premature optimization IMO. Aim for clean, readable code that performs *reasonably* well first; then, if that's not good enough, you can start making micro-optimizations.

Aaronaught 2010-03-09 18:58:14

@Philip Fourie: Indeed, and we haven't even talked about word derivatives, (partial) homonyms and misspellings. A good stop word algorithm would probably remove *totally* as well as *total*, and also remove *exessive* as an obvious misspelling of *excessive*. These substitutions become more and more difficult to do with any search-and-replace approach.

Aaronaught 2010-03-09 19:02:17

Aaronaught. Thanks, I think this will take me where I need to go. qstarin has a point regarding the creation of strings, however, the stop word list is limited in size and scope and so, ultimately, is the search phrase. Search phrases will average less than 4 words and very seldom exceed 5 or 6. The stop list is relatively static and currently contains less than 30 words.Thanks.

Steve Brouillard 2010-03-09 19:03:05

Make sure you eliminate any punctuation, though, if you're going to be splitting on space.

Gabe 2010-03-09 19:09:48

@aaronaught: Entirely agree. Which is why the first word I said was "If."

qstarin 2010-03-10 18:11:59

Answer 7

A:

If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:

Start by creating a Regex from the array input. Something like:

var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";

That will result in something like:

\b(left|right|chronic)\b

Then create a Regex object to do the find/replace:

System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);

Then you can just do a Replace like so:

string fixedPhrase = regex.Replace(phrase, "");

statichippo 2010-03-09 18:13:07

ansaurus

tags:

views:

answers:

.NET String parsing performance improvement - Possible Code Smell

related questions