Web scraping - how to identify main content on a webpage

kefeizhou picture kefeizhou · Jan 12, 2011 · Viewed 21.3k times · Source

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)

Answer

gte525u picture gte525u · Jan 12, 2011

There are a number of ways to do it, but, none will always work. Here are the two easiest:

  • if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
  • Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.