views:

28

answers:

1

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.

For example, if the path is

C:\home\user\research\whitepapers\analysis\detail.txt

I'd like the user to be able to find it by queriying for path:whitepapers.

Likewise, if the URI is

http://www.stackoverflow.com/questions/ask

A query containing uri:questions would retrieve it.

Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)

Suggestions welcome!

A: 

You can use StandardAnalyzer. I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:

public void testBackslashes() throws Exception {
  assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
  assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});

}

This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

Yuval F
Thanks! Using the StandardAnalyzer to index path segments also works in Lucene.Net 2.4.0.
dthrasher
Do you know of an out-of-the-box Lucene Analyzer that would break the domain name apart at the "dots" or separate the filename from its extension?
dthrasher
Maybe you can use LetterTokenizer http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/LetterTokenizer.html chained with some filter. LetterTokenizer divides text at non-letters.
Yuval F