How to build a simple inverted index?

Munichong picture Munichong · Sep 20, 2012 · Viewed 22.7k times · Source

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.

Now, I have several questions:

  1. What kind of data structure is often used for building inverted index? Multidimensional list?

  2. After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?

Answer

Felipe Hummel picture Felipe Hummel · Sep 20, 2012

You can see a very simple implementation of inverted index and search in TinySearchEngine.

For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:

val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]

or a Java-esque:

HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();

The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:

case class Posting(docId:Int, var termFrequency:Int)

Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.

Regarding your second question, there are normally two cases:

(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.

(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.

For case (1), you can have at least 2 files:

1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.

 Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
 Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
 Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
 Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
 ...
 TermN<docId5,termFreq><docId7,termFreq>

2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.

Term1 -> 0
Term2 -> 126
Term3 -> 222
....

Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.

For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.

The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.

To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.

Lucene calls this "little" indexes segments.

The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.