ansaurus

Question

Correct way to write a Tokenizer in Lucene

Answer 1

+2 A:

Generally, when indexing documents that contain HTML markup with Lucene, you should first parse out the HTML into a textual representation with the parts you want to leave, and only then feed it to the Tokenizer to be indexed.

See jGuru: How can I index HTML documents? for an FAQ explaining more of how to do this.

Avi 2009-12-01 15:47:36

Answer 2

+1 A:

This is most easily achieved by pre processing the text before giving it to lucene to tokenize. Use an html parser, like Jericho to convert your content into text with no html by stripping out tags you dont care about, and extracting the text from those that you do. Jericho's TextExtractor is perfect for this, and easy to use.

String text = "Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi"
 +"Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete"
 +"scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.";

TextExtractor te = new TextExtractor(new Source(text)){
 @Override
 public boolean excludeElement(StartTag startTag) {
  return startTag.getName() != HTMLElementName.A;
 }
};
System.out.println(te.toString());

This outputs:

Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativiLinux, UNIX e Windows. Questo documento sta sulla piattaforma KM e lo potetescaricare a questo link.

You could use a custom Lucene Tokenizer with an html Filter, but it's not the easiest solution - using Jericho will defn save you development time for this task. The existing html analysers for lucene probably don't want to do exactly what you want, as they will keep all text on the page. The only caveat to this is that you will end up processing the text twice, rather than all as one stream, but unless you are handling Terabytes of data you aint gonna care about this performance consideration, and dealing with performance is something best left untill you have your app fleshed out and have identified it as an issue anyway.

Joel 2009-12-01 15:57:42

ansaurus

tags:

views:

answers:

Correct way to write a Tokenizer in Lucene

related questions