How can I extract only the main textual content from an HTML page?

Renato Dinhani picture Renato Dinhani · Aug 11, 2011 · Viewed 19.5k times · Source

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Answer

Kurt Kaylor picture Kurt Kaylor · Aug 13, 2011

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.