I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.
In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.
I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.
There are two methods that I could think of:
Use a machine learning based algorithm (like this: http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ )
Develop a web scraper that could satisfactorily remove all clutter from web pages.
Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in PHP or Python which do this kind of web scraping?
If I need to build my own tool to do this, what would you guys suggest to go about it?
Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.
But I don't know how to extract content after this step.
PS. I am an amateur and would love if there were ready to use open source tools that do this, and can be easily integrated into my code that I'll write in PHP or Python. Or if I have to write my own code, I'd love to get guidance who's done such work before! :) Thanks a lot!
did you type 'python readability' into google? there is a pretty popular (200+ followers) library on github.
https://github.com/buriy/python-readability
Additionally, there is a php one if you were to type 'php readability' though it has 100 followers it has not had activity for almost two years https://github.com/feelinglucky/php-readability
and finally the most popular (350+ github folowers) is the ruby readability port https://github.com/iterationlabs/ruby-readability
At the very least you can see how these 3 different projects accomplish parsing the "important parts" of a webpage.