Extract Links from Webpage using R

Question 1

Extract Links from Webpage using R

r web-scraping

Btibert3 · Sep 19, 2010 · Viewed 38.9k times · Source

Answer

Answer

Even easier with rvest:

library(xml2)
library(rvest)

URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"

pg <- read_html(URL)

head(html_attr(html_nodes(pg, "a"), "href"))

## [1] "//stackoverflow.com"                                                                                                                                          
## [2] "http://chat.stackoverflow.com"                                                                                                                                
## [3] "//stackoverflow.com"                                                                                                                                          
## [4] "http://meta.stackoverflow.com"                                                                                                                                
## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider"                                                       
## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"

Question 2

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage

I am very new to programming, and am just starting out with R, so I am hoping this question is pretty basic, but given those posts above, I imagine that it is.

All I am looking to do is extract links that match a given pattern. I feel like I could probably use RCurl to read in the web pages and extract them brute force method using string expressions. That said, if the webpage is fairly well formed, how would I go about doing so using the XML package.

As I learn more, I like to "look" at the data as I work through the problem. The issue is that some of these approaches generate lists of lists of lists, etc., so it is hard for someone that is new (like me) to walk through where I need to go.

Again, I am very new to all that is programming, so any help or code snippets will be greatly appreciated.

Extract Links from Webpage using R

Answer

Related questions