I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream
.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
Somehow, an AttributeSource
is supposed to be used, rather than Token
s. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?
Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
According to Donotello, TermAttribute
has been deprecated in favor of CharTermAttribute
. According to jpountz (and Lucene's documentation), addAttribute
is more desirable than getAttribute
.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}