ansaurus

Question

Example using WikipediaTokenizer in Lucene

Answer 1

A:

WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

Token token = new Token();

token = tf.next(token);

http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

Regards

Raul Lapeira Herrero 2010-10-13 14:25:54

Answer 2

+3 A:

In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

while (tf.incrementToken()) {
  String tokText = termAtt.term();
  //System.out.println("Text: " + tokText + " Type: " + token.type());
  String expectedType = (String) tcm.get(tokText);
  assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
  assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
  count++;
  if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
    numItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
    numBoldItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
    numCategory++;
  }
  else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
    numCitation++;
  }
}
...

hakan 2010-10-13 15:07:37

ansaurus

tags:

views:

answers:

Example using WikipediaTokenizer in Lucene

related questions