I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).
I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?
On the above link, it says Lucene uses this algorithm for indexing:
- incremental algorithm:
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b=10 be the merge factor; M=8
for (size = 1; size < M; size *= b) {
if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}
How does this algorithm provide optimized indexing?
Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?
There's a fairly good article here: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/
Edit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is http://lucene.apache.org/core/3_6_2/fileformats.html
There's an even more recent version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description, but it seems to have less information in it than the older one.
In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.
Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/
When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.