ansaurus

Question

What analyzer should I use for a URL in lucene.net?

Answer 1

+1 A:

You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.

Jonathan Feinberg 2009-12-03 17:09:30

Answer 2

+2 A:

The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:

public class MyAnalyzer extends Analyzer {

public MyAnalyzer() {
 super();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
 TokenStream result = new MyUrlTokenizer(reader);
 result = new LowerCaseFilter(result);
 result = new StopFilter(result);
 result = new SynonymFilter(result);

 return result;
}

}

Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.

Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.

e.g.

public TokenStream tokenStream(String fieldName, Reader reader) {
 TokenStream result = null;
            if (fieldName.equals("url")) {
                  result = new MyUrlTokenizer(reader);
            } else {
                  result = new StandardTokenizer(reader);
            }

Joel 2009-12-03 17:13:15

I know this is Java - but same principle, in theory, for .NET

Joel 2009-12-03 17:18:30

Thanks for the information, I've look at the StandardTokenizer and I really don't understand half of it! I don't need or want all of the code handed to me on a plate but a nudge in the right direction of how to create a customer tokenizer based on those stop characters would be amazing. Thanks.

John_ 2009-12-07 09:38:03

You can probably just copy it and edit it to add the additional tokens you need. BTW - i should have mentioned, if any of your analyzers are doing any expensive initialisation (like hige lists of stop words) you should use the reusableTokenStream method.

Joel 2009-12-07 14:52:27

Thanks Joel. I ended up creating a tokenizer which inherited from CharTokenizer as this seemed simpler and did what I required.

John_ 2009-12-07 16:24:46

ansaurus

tags:

views:

answers:

What analyzer should I use for a URL in lucene.net?

related questions