I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.
Pocket official webpage: http://getpocket.com/
This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).
I would recommend NReadability, together with HtmlAgilityPack
Main text is always in div with id readInner
after NReadability transcoded the page.
//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}