I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)
try {
//connecting to mysql db
ResultSet res = st
.executeQuery("SELECT id, title, production_year " +
"FROM title " +
"WHERE kind_id =1 " +
"LIMIT 0 , 100000");
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" +
""+year+","+year+"&title="+movieName+"" +
"&title_type=feature,short,documentary,unknown";
Document doc = Jsoup.connect(baseUrl)
.userAgent("Mozilla")
.timeout(0).get();
.....
.....
//insert ratings into database
...
I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)
I searched a lot for this error and I found it is a server side error with 5xx error number.
Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:
Exceptions while I am extracting data from a Web site
Jsoup error handling when couldn't connect to website
Handling connection errors and JSoup
but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".
I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.
Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:
org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
......
hope it can help those have the same error.
however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".
I appreciate any suggestion to fix this error.
Upgrading my comments to an answer:
Connection.Response
is org.jsoup.Connection.Response
To allow document
instance only when there is a valid http code (200), break your call into 3 parts; Connection
, Response
, Document
Hence, your part of the code above gets modified to:
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""
+ year + "," + year + "&title=" + movieName + ""
+ "&title_type=feature,short,documentary,unknown";
Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").timeout(10000);
Connection.Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
....
}