views:

48

answers:

1

Hi,

I am intending to use the n-gram part/algorithm of this code:

http://www.codeproject.com/KB/cs/tfidf.aspx

The algorithm produces these tri-gram results:

t th the he e q qu qui uic ick ck k r re red ed d

for:

the quick red

However, this source:

http://en.wikipedia.org/wiki/Trigram

reckons it should be:

the qui k_r he_ uic _re e_q ick red qu ck

(space indicated by ‘_’).

What is correct? Are there any other C# implementation out there?

Thanks.

Best wishes,

Christian

A: 

The second example is correct.

ps. Why do you generate trigrams for the complete text and not only for words? What is your use case?

Skarab
I believe this is useful for 'words' which actually consist of two strings (i.e. are separated by a space). This would be lost if a apply a word breaker first.
csetzkorn
The second output is correct.
Skarab