KXmlParser throws "Unexpected token" exception at the start of RSS pasing

Bostone picture Bostone · Mar 6, 2013 · Viewed 9.4k times · Source

I'm trying to parse an RSS feed from Monster on Android v.17 using this URL:

http://rss.jobsearch.monster.com/rssquery.ashx?q=java

To get the content I'm using HttpUrlConnection in the following fashion

this.conn = (HttpURLConnection) url.openConnection();
this.conn.setConnectTimeout(5000);
this.conn.setReadTimeout(10000);
this.conn.setUseCaches(true);
conn.addRequestProperty("Content-Type", "text/xml; charset=utf-8");
is = new InputStreamReader(url.openStream());

What comes back is as far as I can say (and I verified it too) a legit RSS

Cache-Control:private
Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:5958
Content-Type:text/xml
Date:Wed, 06 Mar 2013 17:15:20 GMT
P3P:CP=CAO DSP COR CURa ADMa DEVa IVAo IVDo CONo HISa TELo PSAo PSDo DELa PUBi BUS LEG PHY ONL UNI PUR COM NAV INT DEM CNT STA HEA PRE GOV OTC
Server:Microsoft-IIS/7.5
Vary:Accept-Encoding
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET

It starts like this (click the URL above if you want to see complete XML):

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Monster Job Search Results java</title>
    <description>RSS Feed for Monster Job Search</description>
    <link>http://rss.jobsearch.monster.com/rssquery.ashx?q=java</link>

But when I attempt to parse it:

final XmlPullParser xpp = getPullParser();
xpp.setInput(is);
for (int type = xpp.getEventType(); type != XmlPullParser.END_DOCUMENT; type = xpp.next()) { /* pasing goes here */ }

The code immediately chokes on type = xpp.next() with the following Exception

03-06 09:27:27.796: E/AbsXmlResultParser(13363): org.xmlpull.v1.XmlPullParserException: 
   Unexpected token (position:TEXT @1:2 in java.io.InputStreamReader@414b4538) 

Which actually means it cannot process 2nd char at line 1 <?xml version="1.0" encoding="utf-8"?>

Here are the offending lines in the KXmlParser.java (425-426). The type == TEXT evaluates to true

if (depth == 0 && (type == ENTITY_REF || type == TEXT || type == CDSECT)) {
    throw new XmlPullParserException("Unexpected token", this, null);
}

Any help? I did try to set parser to XmlPullParser.FEATURE_PROCESS_DOCDECL = false but that didn't help

I did research this on the web and here and can't find anything that helps

Answer

Vladimir Mironov picture Vladimir Mironov · Mar 10, 2013

The reason you are getting the error is that the xml file doesn't actually start with <?xml version="1.0" encoding="utf-8"?>. It starts with three special bytes EF BB BF which are Byte order mark.

Hex representation

InputStreamReader doesn't handle these bytes automatically, so you have to handle them manually. The simplest way to it is to use BOMInpustStream available in Commons IO library:

this.conn = (HttpURLConnection) url.openConnection();
this.conn.setConnectTimeout(5000);
this.conn.setReadTimeout(10000);
this.conn.setUseCaches(true);
conn.addRequestProperty("Content-Type", "text/xml; charset=utf-8");
is = new InputStreamReader(new BOMInputStream(conn.getInputStream(), false, ByteOrderMark.UTF_8));  

I've checked the code above and it works well for me.