How is an aggregator built?

Mircea picture Mircea · May 30, 2009 · Viewed 11.2k times · Source

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that?

Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an indexing system to index and organize the information I crawled and also be a search engine?

Are systems like Nutch lucene.apache.org/nutch OK to be used for what I want? Do you recommend something else?

Or can you recommend another approach?

For example, how Techmeme.com is built? (it's an aggregator of technology news and it's completely automated - only recently they added some human intervention). What would it take to build such a service?

Or how do Kayak.com aggregate their data? (It's a travel aggregator service.)

Answer

monksy picture monksy · Oct 8, 2009

This all depends on the aggregator you are looking for.

Types:

  • Losely defined - Generially this requires for you datasource to be very flexible about determining the type of information gathers (answers the question of is this site/information Travel Related? Humour? Business related? )
  • Specific - This relaxes a requirement in the data storage that all of the data is specificially travel related requires for flights, hotel prices, etc.

Typcially an aggregator is a system of sub programs:

  1. Grabber, this searches and grabs all of the content that is needed to be summarized
  2. Summerization- this is typically done through queries to the db and can be adjusted based on user preferences [through programming logic]
  3. View - this formats the information for what the user would like to see and can respond to feedback on the user's likes or dislikes of the item suggested.