R: Using rvest package instead of XML package to get links from URL

Question 1

R: Using rvest package instead of XML package to get links from URL

xml r web-scraping rvest

capm · Dec 4, 2014 · Viewed 12.2k times · Source

Answer

Answer

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

Question 2

I use XML package to get the links from this url.

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

While this method is very efficient, I've used rvest and seems faster at parsing a web than XML. I tried html_nodes and html_attrs but I can't get it to work.

R: Using rvest package instead of XML package to get links from URL

Answer

Related questions