views:

45

answers:

2

I am reading about SOLR and indexing a MySQL database into SOLR.

What do they mean by "tokenize" and "un-tokenize"?

And what does it mean when fields are "normalized"?

I know how and what it means to normalize a database, but a field? How can a simple field be normalized?

Thanks

+1  A: 

the tokenizer splits a character stream into words, which are the atomic units of search. strings can be split based on whitespace, word boundaries, etc. these words are often passed through filters in the second stage which apply additional transformations to the words (like soundex codes, porter stemming, etc). the result is a normalized representation of the words that can be efficiently compared.

for example: "The Cats Eat Cheese!" might be normalized to the words: 1) cat 2) eat 3) cheese

"the" was removed (stopword), cat is now singular (stemming), punctuation is gone, and the words are lower cased.

jspcal
+1  A: 

What do they mean by "tokenize" and "un-tokenize"?

Tokenizing a field enables full text search, i.e. finding any word that occurs anywhere in the field. An Untokenized field will be found only when you have a complete and exact match, e.g. if the field's content is "blue moon" then it will only be found when you search for "blue moon", not when you search only for "blue".

And what does it mean when fields are "normalized"?

This most likely refers to Unicode normalization - Unicode has separate code points for diacritics, e.g. U+0060 is ` (grave accent), so the accented letter è could either be one Unicode character (U+00E8) or composed of two (U+0060 and U+0065). But of course you want both to be found when you search for è.

Michael Borgwardt