Web scraping pdf files from HTML

SChatcha picture SChatcha · Oct 2, 2017 · Viewed 9.3k times · Source

How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows.

https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx

Regards

Answer

KenHBS picture KenHBS · Oct 2, 2017

When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.

library(XML)
library(RCurl)

url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]

links contains all the URLs to the PDF-files you are trying to download.

Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.

With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls

regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)

To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:

for(i in seq_along(links)){
  download.file(links[i], destfile=destination[i])
  Sys.sleep(runif(1, 1, 5))
}