Issues with RCurl crawler based on concurrent requests

marbel picture marbel · Sep 29, 2014 · Viewed 7.7k times · Source

The following is a script to reproduce the problems i'm facing when building a crawler with RCurl that performs concurrent requests. The objective is to download the content of several thousands of web sites in order to perform statistical analysis. Therefore, the solution should scale.

library(RCurl)
library(httr)

uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com", 
         "p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com", 
         "mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com", 
         "xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar", 
         "android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")

### RCurl Concurrent requests 

getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
  content = list()
  curls = list()
  for(i in uris) {
    curl = getCurlHandle()
    content[[i]] = basicTextGatherer()
    opts = curlOptions(URL = i, writefunction = content[[i]]$update,
                       timeout = 2, maxredirs = 3, verbose = TRUE,
                       followLocation = TRUE,...)
    curlSetOpt(.opts = opts, curl = curl)
    multiHandle = push(multiHandle, curl)
  }
  if(.perform) {
    complete(multiHandle)
    lapply(content, function(x) x$value())
  } else {
    return(list(multiHandle = multiHandle, content = content))
  }
}

### Split uris in 3
uris_ls = split(uris, 1:3)

### retrieve content 
uris_content <- list()
for(i in seq_along(uris_ls)){
  uris_content[[i]] <- getURIs(uris_ls[[i]])
}

library(plyr)
a = lapply(uris_content, function(x) ldply(x, rbind))
result = ldply(a, rbind)
names(result) <- c('url', 'content')
result$number_char <- nchar(as.character(result$content))

### Here are examples of url that aren't working
url_not_working = result[result$number_char == 0, 1]

# url_not_working
# [1] "inforapido.com.ar"              "canchallena.lanacion.com.ar"    "fbapp://256002347743983/thread"
# [4] "xnxx.com"                       "startappexchange.com"           "wv.inner-active.mobi"          
# [7] "livefyre.com"   

### Using httr GET it works fine

get_httr = GET(url_not_working[2])
content(g, 'text')

# The result is the same when using a single call
get_rcurl = getURL(url_not_working[2], encoding='UTF-8', timeout = 2, 
                   maxredirs = 3, verbose = TRUE,
                   followLocation = TRUE)
get_rcurl

Question:

Given the amount of web pages i need to crawl, i would rather use RCurl for this as it supports concurrent requests. I wonder if it is posible to improve the getURLs() call in order to make it work as the GET() version in the cases where the getURL/getURLs version fails.

UPDATE:

I've added a gist with more data (990 uris) to better reproduce the problem.

uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

After running:

uris_content <- list()
for(i in seq_along(uris_ls)){
  uris_content[[i]] <- getURIs(uris_ls[[i]])
}

I get the following error:

Error in curlMultiPerform(obj) : embedded nul in string: 'GIF89a\001'
In addition: Warning message:
In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale

Using getURIAsynchronous:

  uris_content <- list()
  for(i in seq_along(uris_ls)){
    uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], 
         .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
         followLocation = TRUE))
  }

I get a similar error: Error in nchar(str) : invalid multibyte string 1

UPDATE 2

library(RCurl)
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

After trying the following:

Sys.setlocale(locale="C")
uris_content <- list()
for(i in seq_along(uris_ls)){
    uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], 
       .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
       followLocation = TRUE))
}

The result is that it works well for the first 225 URLs, then it just returns cero content from the web site. Is this the null error issue?

# This is a quick way to inspect the output:
nc = lapply(uris_content, nchar)
nc[[5]]
 [1]  51422      0     16  19165 111763      6  14041    202   2485      0
[11]  78538      0      0      0 133253  42978      0      0   7880  33336
[21]   6762    194     93      0      0      0      0      0      9      0
[31] 165974  13222  22605   1392      0  42932   1421      0      0      0
[41]      0  13760    289      0   2674

nc[[6]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0

Answer

marbel picture marbel · Oct 9, 2014

As nobody answered, I propose a temporary solution. If getURIAsynchronous doesn't work, just download sequentially using httr::GET and httr::content that doesn't have the null string issue.

library(RCurl)
library(httr)

Sys.setlocale(locale="C")

opts = list(timeout = 2, maxredirs = 3, 
            verbose = TRUE, followLocation = TRUE)

try_asynch <- function(uris, .opts=opts){
  getURIAsynchronous(uris, .opts=opts)
}

get_content <- function(uris){
  cont <- try_asynch(uris)
  nc <- lapply(content, nchar)
  nc <- sapply(nc, function(x) ifelse(sum(x > 0), 1, 0))
  if(sum(nc) < 10){
    r <- lapply(uris, function(x) GET(x))
    cont <- lapply(r, function(x) content(x, 'text'))
  }
  cont
}

docs <- lapply(uris_ls, get_content)