ansaurus

Question

Answer 1

A:

You can use StandardAnalyzer. I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:

public void testBackslashes() throws Exception {
  assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
  assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});

}

This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

Yuval F 2010-09-13 09:10:07

Thanks! Using the StandardAnalyzer to index path segments also works in Lucene.Net 2.4.0.

dthrasher 2010-09-15 19:12:52

Do you know of an out-of-the-box Lucene Analyzer that would break the domain name apart at the "dots" or separate the filename from its extension?

dthrasher 2010-09-15 19:16:10

Maybe you can use LetterTokenizer http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/LetterTokenizer.html chained with some filter. LetterTokenizer divides text at non-letters.

Yuval F 2010-09-16 08:16:50

ansaurus

tags:

views:

answers:

Indexing file paths or URIs in Lucene

related questions