What's the best approach for parsing XML/'screen scraping' in iOS? UIWebview or NSXMLParser?

Benedict Cohen picture Benedict Cohen · Aug 22, 2010 · Viewed 10.7k times · Source

I am creating an iOS app that needs to get some data from a web page. My first though was to use NSXMLParser initWithContentsOfURL: and parse the HTML with the NSXMLParser delegate. However this approach seems like it could quickly become painful (if, for example, the HTML changed I would have to rewrite the parsing code which could be awkward).

Seeing as I'm loading a web page I took take a look at UIWebView too. It looks like UIWebView may be the way to go. stringByEvaluatingJavaScriptFromString: seems like a very handy way to extract the data and would allow the javascript to be stored in a separate file that would be easy to edit if the HTML changed. However, using UIWebView seems a bit hacky (seeing as UIWebView is a UIView subclass it may block the main thread, and the docs say that the javascript has a limit of 10MB).

Does anyone have any advice regarding parsing XML/HTML before I get stuck in?

UPDATE:

I wrote a blog post about my solution:HTML parsing/screen scraping in iOS

Answer

cmar picture cmar · Apr 21, 2011

I've done this a few times. The best approach I've found is to use libxml2 which has a mode for HTML. Then you can use XPath to query the document.

Working with the libxml2 API is not the most enjoyable. So, I usually bring over the XPathQuery.h/.m files documented on this page:

http://cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html

Then I fetch the data using a NSConnection and query the data with something like this:

NSArray *tdNodes = PerformHTMLXPathQuery(self.receivedData, @"//td[@class='col-name']/a/span");

Summary:

  1. Add libxml2 to your project, here are some quick instructions for XCode4: http://cmar.me/2011/04/20/adding-libxml2-to-an-xcode-4-project/

  2. Get the XPathQuery.h/.m

  3. Use an XPath statement to query the html document.