views:

672

answers:

5

Hi all,

I want to drop some keywords meta tags into a page depending on the article being showed.

Let's say you load the page blabla.com/article.aspx?id=2 The article which id equals 2 is titled "The Wisdom of Deliberate Mistakes in Business Management"

So I would like to include the meta tags like this:

<META name="keywords" content="wisdom, deliberate, mistakes, business, management" />

So I need a way to exclude the noisy words (just like SQL Server FullText does). How would you do it?

1) Save the noise words list in webconfig? 2) Save the noise words in database? 3) Save the noise words in a text file? 4) Hardcode the noise words in code (NOT =P)

Then, how would you load those noise words to minimize the page load? And finally, how would you parse the string removing the noise words?

Thanks!

EDIT: The noise (or stop) words would be the same as SQL Server 2005 FTS uses (check noiseENU.txt in MSSQL\FTDATA) Here is the content of that file:

about
1
after
2
all
also
3
an
4
and
5
another
6
any
7
are
8
as
9
at
0
be
$
because
been
before
being
between
both
but
by
came
can
come
could
did
do
does
each
else
for
from
get
got
has
had
he
have
her
here
him
himself
his
how
if
in
into
is
it
its
just
like
make
many
me
might
more
most
much
must
my
never
no
now
of
on
only
or
other
our
out
over
re
said
same
see
should
since
so
some
still
such
take
than
that
the
their
them
then
there
these
they
this
those
through
to
too
under
up
use
very
want
was
way
we
well
were
what
when
where
which
while
who
will
with
would
you
your
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
+1  A: 

If you want to filter "Noise" or "Stop" Words, I would suggest looking into regluar expressions, they are lightning quick at this type of stuff. As far as Implementation, I would probably store the noise/stop words in a table, then use the words to build your regex. You should be able to cache the regex on the server so performance hit should be minimal.

Here is an example based on the words you provided above. Theres a good online regex tester at http://regexpal.com/:

    \b(?:about|1|after|2|all|also|3|an|4|and|5|another|6|any|7|are|8
|as|9|at|0|be|$|because|been|before|being|between|both|but|by|came|can|come|
could|did|do|does|each|else|for|from|get|got|has|had|he|have|her|here|him|
himself|his|how|if|in|into|is|it|its|just|like|make|many|me|might|more|most|
much|must|my|never|no|now|of|on|only|or|other|our|out|over|re|said|same|see|
should|since|so|some|still|such|take|than|that|the|their|them|then|there|these|
they|this|those|through|to|too|under|up|use|very|want|was|way|we|well|
were|what|when|where|which|while|who|will|with|would|you|your)\b
Rob
+1  A: 

These kinds of words are called "Stop Words" -- that will help you google for some implementation ideas.

My sense is that there isn't much value to doing this -- the title is already considered extremely important for search indexing. Also, is "wisdom" really that relevant a word for the article?

I think the best keywords are human chosen, like tags, and kept to a specific 1 to 3 at most that really describe the content.

But to answer your question -- how many do you think they'll be? If I were going to do this, I'd keep them in a database (if I was already using a db) and if they impact performance, pre-load them into memory (it can be shared by all sessions).

Lou Franco
+1 Pretty much what I was going to say, but I feel the value of meta tags *at all* is dubious at best these days.
annakata
Yes, I thought about tags... but the client doesn't like the idea of human entering tags, they want to add the "tags" automatically. The first step would be just the title, then they want something really weird like reading the body of the article and decide which are the key words. Sounds crazy and extremly hard =PBut the human tags thing is not an option =(So you all agree with the idea of storing the "stop words" in the database and not anywhere else?
emzero
In answer to your question about how many "stop words". I was just thinking of using the same as SQL Server uses (I'm going to edit the article to include the content of the file noiseENU.txt of SQLServer 2005 FTS)
emzero
A: 

You might have a look at my post Automatic generation of META tags for ASP.NET. There I make use of noise words (or stop words) in English, French, Spanish and German. For every language I have 3 arrays of: standard noise words, most common verbs and a third with their conjugations. This way you can either remove noise words along with verbs and conjugations, even irregular verbs (in other languages than English, conjugations are much complex than -ed, -ing and -s terminations).

The sample VB project code provided creates meta title, meta keywords and meta description for every asp.net page (.aspx) on the fly, without user intervention and its cpu hit is only at compile time (first request). Once the page is compiled, their tags (title, keywords, description) remain without any cpu footprint. This is because the metas are calculated and pushed inside the file on the fly before it is actually compiled, thanks to VirtualPathProviders (the filesystem is not modified at any time).

I store them in coded arrays which are sorted in order to be able to use binary search algorithm.

I hope this can help you in any way. Regards.

Thanks! I'm going to take a look at your post.
emzero
A: 

@Rob's answer pointed me in the right direction for a similar task. Here's the working function I ended up with. The file noiseENU.txt is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function
Herb Caudill
A: 

here is the solution

  1. define variable $NOISE_WORDS as xs:string*
  2. {
  3. (: \b is a word boundary. This catches beginning,
  4. end, and middle of string matches on whole words. :)
  5. ('\bthe\b', '\bof\b', '\ban\b', '\bor\b',
  6. '\bis\b', '\bon\b', '\bbut\b', '\ba\b')
  7. }
    1. define function remove-noise-words($string, $noise)
    2. {
    3. (: This is a recursive function. :)
    4. if(not(empty($noise))) then
    5. remove-noise-words(
    6. replace($string, $noise[1], '', 'i'),
    7. (: This passes along the noise words after
    8. the one just evaluated. :)
    9. $noise[position() > 1]
    10. )
    11. else normalize-space($string)
    12. }
      1. let $source-string1 := "The Tragedy of King Lear"
    13. let $source-string2 := "The Tragedy OF King Lear These an"
    14. let $source-string3 :=
    15. "The Tragedy of the an of King Lear These of"
    16. let $source-string4 := "The of an of"
    17. (: Need to handle empty result if all noise words,
    18. as in #4 above. :)
    19. let $final :=
    20. remove-noise-words($source-string1, $NOISE_WORDS)
    21. return $final

Visit [link text][1]

[1]: http://filesharepoint.com for more details..!

randi