ansaurus

Question

Best way to remove "noise" words in ASP .NET 3.5 Web App?

Answer 1

+1 A:

If you want to filter "Noise" or "Stop" Words, I would suggest looking into regluar expressions, they are lightning quick at this type of stuff. As far as Implementation, I would probably store the noise/stop words in a table, then use the words to build your regex. You should be able to cache the regex on the server so performance hit should be minimal.

Here is an example based on the words you provided above. Theres a good online regex tester at http://regexpal.com/:

    \b(?:about|1|after|2|all|also|3|an|4|and|5|another|6|any|7|are|8
|as|9|at|0|be|$|because|been|before|being|between|both|but|by|came|can|come|
could|did|do|does|each|else|for|from|get|got|has|had|he|have|her|here|him|
himself|his|how|if|in|into|is|it|its|just|like|make|many|me|might|more|most|
much|must|my|never|no|now|of|on|only|or|other|our|out|over|re|said|same|see|
should|since|so|some|still|such|take|than|that|the|their|them|then|there|these|
they|this|those|through|to|too|under|up|use|very|want|was|way|we|well|
were|what|when|where|which|while|who|will|with|would|you|your)\b

Rob 2009-07-06 18:22:16

Answer 2

+1 A:

These kinds of words are called "Stop Words" -- that will help you google for some implementation ideas.

My sense is that there isn't much value to doing this -- the title is already considered extremely important for search indexing. Also, is "wisdom" really that relevant a word for the article?

I think the best keywords are human chosen, like tags, and kept to a specific 1 to 3 at most that really describe the content.

But to answer your question -- how many do you think they'll be? If I were going to do this, I'd keep them in a database (if I was already using a db) and if they impact performance, pre-load them into memory (it can be shared by all sessions).

Lou Franco 2009-07-06 18:22:51

+1 Pretty much what I was going to say, but I feel the value of meta tags *at all* is dubious at best these days.

annakata 2009-07-06 18:33:55

Yes, I thought about tags... but the client doesn't like the idea of human entering tags, they want to add the "tags" automatically. The first step would be just the title, then they want something really weird like reading the body of the article and decide which are the key words. Sounds crazy and extremly hard =PBut the human tags thing is not an option =(So you all agree with the idea of storing the "stop words" in the database and not anywhere else?

emzero 2009-07-06 18:43:46

In answer to your question about how many "stop words". I was just thinking of using the same as SQL Server uses (I'm going to edit the article to include the content of the file noiseENU.txt of SQLServer 2005 FTS)

emzero 2009-07-06 18:48:02

Answer 3

A:

You might have a look at my post Automatic generation of META tags for ASP.NET. There I make use of noise words (or stop words) in English, French, Spanish and German. For every language I have 3 arrays of: standard noise words, most common verbs and a third with their conjugations. This way you can either remove noise words along with verbs and conjugations, even irregular verbs (in other languages than English, conjugations are much complex than -ed, -ing and -s terminations).

The sample VB project code provided creates meta title, meta keywords and meta description for every asp.net page (.aspx) on the fly, without user intervention and its cpu hit is only at compile time (first request). Once the page is compiled, their tags (title, keywords, description) remain without any cpu footprint. This is because the metas are calculated and pushed inside the file on the fly before it is actually compiled, thanks to VirtualPathProviders (the filesystem is not modified at any time).

I store them in coded arrays which are sorted in order to be able to use binary search algorithm.

I hope this can help you in any way. Regards.

2009-08-30 18:58:44

Thanks! I'm going to take a look at your post.

emzero 2009-09-01 18:44:46

Answer 4

A:

@Rob's answer pointed me in the right direction for a similar task. Here's the working function I ended up with. The file noiseENU.txt is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function

Herb Caudill 2010-01-22 22:06:57

Answer 5

A:

here is the solution

define variable $NOISE_WORDS as xs:string*
{
(: \b is a word boundary. This catches beginning,
end, and middle of string matches on whole words. :)
('\bthe\b', '\bof\b', '\ban\b', '\bor\b',
'\bis\b', '\bon\b', '\bbut\b', '\ba\b')
}
1. define function remove-noise-words($string, $noise)
2. {
3. (: This is a recursive function. :)
4. if(not(empty($noise))) then
5. remove-noise-words(
6. replace($string, $noise[1], '', 'i'),
7. (: This passes along the noise words after
8. the one just evaluated. :)
9. $noise[position() > 1]
10. )
11. else normalize-space($string)
12. }
13. 1. let $source-string1 := "The Tragedy of King Lear"
14. let $source-string2 := "The Tragedy OF King Lear These an"
15. let $source-string3 :=
16. "The Tragedy of the an of King Lear These of"
17. let $source-string4 := "The of an of"
18. (: Need to handle empty result if all noise words,
19. as in #4 above. :)
20. let $final :=
21. remove-noise-words($source-string1, $NOISE_WORDS)
22. return $final

Visit [link text][1]

[1]: http://filesharepoint.com for more details..!

randi 2010-06-21 06:30:16

ansaurus

tags:

views:

answers:

Best way to remove "noise" words in ASP .NET 3.5 Web App?

related questions