views:

345

answers:

2

I am using Lucene.Net 2.0 to index some fields from a database table. One of the fields is a 'Name' field which allows special characters. When I perform a search, it does not find my document that contains a term with special characters.

I index my field as such:

Directory DALDirectory = FSDirectory.GetDirectory(@"C:\Indexes\Name", false);
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(DALDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();
doc.Add(new Field("Name", "Test (Test)", Field.Store.YES, Field.Index.TOKENIZED));
indexWriter.AddDocument(doc);

indexWriter.Optimize();
indexWriter.Close();

And I search doing the following:

value = value.Trim().ToLower();
value = QueryParser.Escape(value);

Query searchQuery = new TermQuery(new Term(field, value));
Searcher searcher = new IndexSearcher(DALDirectory);

TopDocCollector collector = new TopDocCollector(searcher.MaxDoc());
searcher.Search(searchQuery, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;

If I perform a search for field as 'Name' and value as 'Test', it finds the document. If I perform the same search as 'Name' and value as 'Test (Test)', then it does not find the document.

Even more strange, if I remove the QueryParser.Escape line do a search for a GUID (which, of course, contains hyphens) it finds documents where the GUID value matches, but performing the same search with the value as 'Test (Test)' still yields no results.

I am unsure what I am doing wrong. I am using the QueryParser.Escape method to escape the special characters and am storing the field and searching by the Lucene.Net's examples.

Any thoughts?

+2  A: 

StandardAnalyzer strips out the special characters during indexing. You can pass in a list of explicit stopwords (excluding the ones you want in).

Mikos
Should I consider another Analyzer to use to achieve my goal?What about switching between Tokenized to Un_Tokenized when storing fields with special characters?
Brandon
well if you don't tokenize the field you cannot "search" on it. You have a couple of choices write your own analyzer (is very simple) or pass the list of stop words to StandardAnalyzer.something like:Hashtable htStopwords = new Hashtable();Analyzer analyzer = new StandardAnalyzer(htStopwords);
Mikos
you can also look at StopAnalyzer or SimpleAnalyzer...they might help. The problem is that you could end up having a lot of noise words. But if that is not an issue....
Mikos
A: 

While index, you have tokenized the field. So, your input String creates two tokens "test" and "test". For search, you are constructing query by hand ie using TermQuery instead of QueryParser, which would have tokenized the field.

For the entire match, you need to index field UN_TOKENIZED. Here, the input string is taken as a single token. The single token created "Test (Test)." In that case, your current search code will work. You have to watch the case of input string carefully to make sure if you are indexing lower case text, you have to do the same while searching.

It is generally good practice to use same analyzer during indexing and searching. You can use KeywordAnalyer to generate single token from the input string.

Shashikant Kore