tags:

views:

47

answers:

3

Hi there,

I'm new to Lucene, i started learning the version 3 branch and there's one thing i don't understand (obviously because i'm not experienced in the subject).

In Lucene 2.9, if i wanted a list of tokens i would create an ArrayList of Token class, ArrayList for example. That's pretty intuitive for me and the concept of token is very clear.

Now that the use of Token class is disencouraged in favour of the Attribute based API, do i have to create my own class to encapsulate the attributes i want? If yes, isn't that almost recreating the Lucene's Token class?

I'm doing a class to test analyzers, and having a list of resulting tokens makes it easier to test, i guess.

Any help would be appreciated ;) Thank you!

A: 

I think you can do something like this:

TokenStream tkst = analyzer.tokenStream("field", "text");
Token token = tkst.getAttribute(Token.class);
while (tkst.incrementToken()) {
// Do something with token.
}

The proper documentation is in the analysis package: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/package-summary.html

smmv
+1  A: 

Use the TermAttribute class:

TokenStream stream = analyzer.tokenStream("field", "text");
TermAttribute termAttr = stream.getAttribute(TermAttribute.class);
while (stream.incrementToken()) {
    String token = termAttr.term();
}
larsmans
Thanks anwsering,but it doesn't reply to my question.I know how to get attributes from a tokenstream,in the code you're only getting termattribute,so you can save each term on a string[] and there's your list of tokens. But in case you want also a offsetattribute,then you have 2 attributes and can't save them both on a string[],and my question is related to that.. the Token class encapsulates various attributes in a same structure, and i need to now if in Lucene 3, since they disencourage the use of Token, what is the recomended solution to encapsulate various attributes in the same structure?
Fabio
Apparently there isn't any, at least not that I know of. I've been surprised by this decision as well. The Lucene developers apparently favor optimization over proper API design.
larsmans
+1  A: 

According to the Token Javadoc, "Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API."

I suggest you keep using a Token. It matches the description above.

Yuval F
Thanks, i was misunderstanding the notes about Token class ;)
Fabio