How do i implement tag searching? with lucene?

user34537 picture user34537 · Mar 13, 2010 · Viewed 7.2k times · Source

I havent used lucene. Last time i ask (many months ago, maybe a year) people suggested lucene. If i shouldnt use lucene what should i use? As am example say there are items tagged like this

  1. apples carrots
  2. apples
  3. carrots
  4. apple banana

if a user search apples i dont care if there is any preference from 1,2 and 4. However i seen many forums do this which i HATED is when a user search apple carrots 2 and 3 have high results while 1 is hard to find even though it matches my search more closely.

Also i would like the ability to do search carrots -apples which will only get me 3. I am not sure what should happen if i search carrots banana but anyways as long as more items tagged with 2 and 3 results are lower ranking then 1 when i search apples carrots i'll be happy.

Can lucene do this? and where do i start? I tried looking it up and when i do i see a lot of classes and i'll see tutorials talking about documents, webpages but none were clear about what to do when i like to tag something. If not lucene what should i use for tagging?

Answer

Yuval F picture Yuval F · Mar 14, 2010

Edit: You can use Lucene. Here's an explanation how to do this in Lucene.net. Some Lucene basics are:

  • Document - is the storage unit in Lucene. It is somewhat analogous to a database record.
  • Field - the search unit in Lucene. Analogous to a database column. Lucene searches for text by taking a query and matching it against fields. A field should be indexed in order to enable search.
  • Token - the search atom in Lucene. Usually a word, sometimes a phrase, letter or digit.
  • Analyzer - the part of Lucene that transforms a field into tokens.

Please read this blog post about creating and using a Lucene.net index.

I assume you are tagging blog posts. If I am totally wrong, please say so. In order to search for tags, you need to represent them as Lucene entities, namely as tokens inside a "tags" field.

One way of doing so, is assigning a Lucene document per blog post. The document will have at least the following fields:

  • id: unique id of the blog post.
  • content: the text of the blog post.
  • tags: list of tags.

Indexing: Whenever you add a tag to a post, remove a tag or edit it, you will need to index the post. The Analyzer will transform the fields into their token representation.

Document doc = new Document();
doc.Add(new Field("id", i.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("content", text, Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("tags", tags, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

The remaining part is retrieval. For this, you need to create a QueryParser and pass it a query string, like this:

QueryParser qp = new QueryParser();
Query q = qp.Parse(s);
Hits = Searcher.Search(q);

The syntax you need for s will be:

tags: apples tags: carrots

To search for apples or carrots

tags: carrots NOT tags: apples

See the Lucene Query Parser Syntax for details on constructing s.