java.io.FileNotFoundException for valid URL

Alexei picture Alexei · May 8, 2010 · Viewed 34.3k times · Source

I use library rome.dev.java.net to fetch RSS.

Code is

URL feedUrl = new URL("http://planet.rubyonrails.ru/xml/rss");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl));

You can check that http://planet.rubyonrails.ru/xml/rss is valid URL and the page is shown in browser.

But I get exception from my application

java.io.FileNotFoundException: http://planet.rubyonrails.ru/xml/rss
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1311)
        at com.sun.syndication.io.XmlReader.<init>(XmlReader.java:237)
        at com.sun.syndication.io.XmlReader.<init>(XmlReader.java:213)
        at rssdaemonapp.ValidatorThread.run(ValidatorThread.java:32)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

I don't use any proxy. I get this exception on my PC and on the production server and only for this URL, other URLs are working.

Answer

Stephen C picture Stephen C · May 8, 2010

The code that is throwing that exception looks like this ... assuming I've got the right version:

if (respCode >= 400) {
    if (respCode == 404 || respCode == 410) {
        throw new FileNotFoundException(url.toString());
    } else {
        throw new java.io.IOException(
            "Server returned HTTP"
            + " response code: " + respCode
            + " for URL: " + url.toString());
    }
}

In other words, when you are doing the GET from Java, you are getting a 404 or 410 response. Now when I do the request using the wget utility, I get a 200 response. So my guess is that the problem is one of the following:

  • You happened to make the request when they were suffering from some configuration problem.
  • They have implemented their server to return 404 / 410 for certain User-Agent strings.

Other possibilities are that they are doing some kind of server-side filtering on IP addresses or that there is some DNS problem that is causing your requests to go to a different IP address. But both of these seem to be contradicted by the fact that you can access the feed in your browser.

If this is the User-Agent, take a look at their terms of service to see if they have a banned certain kinds of use of their site / RSS feed.