What is the best way to parse a web page in Ruby?

Question 1

What is the best way to parse a web page in Ruby?

html xml ruby screen-scraping

Jeremy Mack · Sep 26, 2008 · Viewed 9.5k times · Source

Answer

Answer

Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i

And so forth.

Question 2

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?

Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.

I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.

What is the best way to parse a web page in Ruby?

Answer

Related questions