I have a multi-threaded script which occasionally freezes when it connects to a server but the server doesn't send anything back. Netstat shows a connected tcp socket. This happens even if I have TIMEOUT set. The timeout works fine in an unthreaded script. Here's some sample code.
def xmlscraper(url):
htmlpage = StringIO.StringIO()
rheader = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, "user agent string")
c.setopt(pycurl.CONNECTTIMEOUT, 60)
c.setopt(pycurl.TIMEOUT, 120)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.WRITEFUNCTION, htmlpage.write)
c.setopt(pycurl.HEADERFUNCTION, rheader.write)
c.setopt(pycurl.HTTPHEADER, ['Expect:'])
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPGET, 1)
pycurl.global_init(pycurl.GLOBAL_ALL)
for url in urllist:
t = threading.Thread(target=xmlscraper, args=(url,))
t.start()
Any help would be greatly appreciated! been trying to solve this for a few weeks now.
edit: The urllist has about 10 urls. It doesn't seem to matter how many there are.
edit2: I just tested this code out below. I used a php script that sleeps for 100 seconds.
import threading
import pycurl
def testf():
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, 3)
c.setopt(pycurl.TIMEOUT, 6)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://xxx.xxx.xxx.xxx/test.php')
c.setopt(pycurl.HTTPGET, 1)
c.perform()
t = threading.Thread(target=testf)
t.start()
t.join()
Pycurl in that code seems to timeout properly. So I guess it has something to do with the number of urls? GIL?
edit3:
I think it might have to do with libcurl itself cause sometimes when I check the script libcurl is still connected to a server for hours on end. If pycurl was properly timing out then the socket would have been closed.
I modified your 'edit2' code to spawn multiple threads and it works fine on my machine (Ubuntu 10.10 with Python 2.6.6)
import threading
import pycurl
def testf():
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, 3)
c.setopt(pycurl.TIMEOUT, 3)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://localhost/cgi-bin/foo.py')
c.setopt(pycurl.HTTPGET, 1)
c.perform()
for i in range(100):
t = threading.Thread(target=testf)
t.start()
I can spawn 100 threads and all timeout at 3 seconds (like I specified).
I wouldn't go blaming the GIL and thread contention yet :)