pycurl/curl not following the CURLOPT_TIMEOUT option

Incognito picture Incognito · Dec 28, 2010 · Viewed 8.1k times · Source

I have a multi-threaded script which occasionally freezes when it connects to a server but the server doesn't send anything back. Netstat shows a connected tcp socket. This happens even if I have TIMEOUT set. The timeout works fine in an unthreaded script. Here's some sample code.

def xmlscraper(url):
  htmlpage = StringIO.StringIO()
  rheader = StringIO.StringIO()
  c = pycurl.Curl()
  c.setopt(pycurl.USERAGENT, "user agent string")
  c.setopt(pycurl.CONNECTTIMEOUT, 60)
  c.setopt(pycurl.TIMEOUT, 120)
  c.setopt(pycurl.FOLLOWLOCATION, 1)
  c.setopt(pycurl.WRITEFUNCTION, htmlpage.write)
  c.setopt(pycurl.HEADERFUNCTION, rheader.write)
  c.setopt(pycurl.HTTPHEADER, ['Expect:'])
  c.setopt(pycurl.NOSIGNAL, 1)
  c.setopt(pycurl.URL, url)
  c.setopt(pycurl.HTTPGET, 1)

pycurl.global_init(pycurl.GLOBAL_ALL)
for url in urllist:
    t = threading.Thread(target=xmlscraper, args=(url,))
    t.start()

Any help would be greatly appreciated! been trying to solve this for a few weeks now.

edit: The urllist has about 10 urls. It doesn't seem to matter how many there are.

edit2: I just tested this code out below. I used a php script that sleeps for 100 seconds.

import threading
import pycurl
def testf():
    c = pycurl.Curl()
    c.setopt(pycurl.CONNECTTIMEOUT, 3)
    c.setopt(pycurl.TIMEOUT, 6)
    c.setopt(pycurl.NOSIGNAL, 1)
    c.setopt(pycurl.URL, 'http://xxx.xxx.xxx.xxx/test.php')
    c.setopt(pycurl.HTTPGET, 1)
    c.perform()
t = threading.Thread(target=testf)
t.start()
t.join()

Pycurl in that code seems to timeout properly. So I guess it has something to do with the number of urls? GIL?

edit3:

I think it might have to do with libcurl itself cause sometimes when I check the script libcurl is still connected to a server for hours on end. If pycurl was properly timing out then the socket would have been closed.

Answer

Corey Goldberg picture Corey Goldberg · Dec 28, 2010

I modified your 'edit2' code to spawn multiple threads and it works fine on my machine (Ubuntu 10.10 with Python 2.6.6)

import threading
import pycurl

def testf():
    c = pycurl.Curl()
    c.setopt(pycurl.CONNECTTIMEOUT, 3)
    c.setopt(pycurl.TIMEOUT, 3)
    c.setopt(pycurl.NOSIGNAL, 1)
    c.setopt(pycurl.URL, 'http://localhost/cgi-bin/foo.py')
    c.setopt(pycurl.HTTPGET, 1)
    c.perform()

for i in range(100):
    t = threading.Thread(target=testf)
    t.start()

I can spawn 100 threads and all timeout at 3 seconds (like I specified).

I wouldn't go blaming the GIL and thread contention yet :)