Android Web Scraping with a Headless Browser

Pierre picture Pierre · Jul 1, 2013 · Viewed 15k times · Source

I have spent a day on researching a library that can be used to accomplish the following:

  • Retrieve the full contents of a webpage like in the background without rendering result to a view.
  • The lib should support pages that fires off ajax requests to load some additional result data after the initial HTML has loaded for example.
  • From the resulting html I need to grab elements in xpath or css selector form.
  • In future I also possibly need to navigate to a next page (fire off events, submitting buttons/links etc)

Here is what I have tried without success:

  • Jsoup: Works great but no support for javascript/ajax (so it does not load full page)
  • Android built in HttpEntity: same problem with javascript/ajax as jsoup
  • HtmlUnit: Looks exactly what I need but after hours cannot get it to work on Android (Other users failed by trying to load the 12MB+ worth of jar files. I myself loaded the full source code and referenced it as a project library only to find that things such as Applets and java.awt (used by HtmlUnit) does not exist in Android).
  • Rhino - I find this very confusing and don't know how to get it working in Android and even if it is what I am looking for.
  • Selenium Driver: Looks like it can work but you don't have an straightforward way to implement it in a headless way so that you don't have the actual html displayed to a view.

I really want HtmlUnit to work as it seems the best suited for my solution. Is there any way or at least another library I have missed that is suitable for my needs?

I am currently using Android Studio 0.1.7 and can move to Ellipse if needed.

Thanks in advance!

Answer

Pierre picture Pierre · Jul 17, 2013

Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.

The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).

The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.

Code:

webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface(context);
webView.addJavascriptInterface(jInterface, "HtmlViewer");

webView.setWebViewClient(new WebViewClient() {

@Override
public void onPageFinished(WebView view, String url) {

   //Load HTML
   webView.loadUrl("javascript:window.HtmlViewer.showHTML
       ('<head>'+document.getElementsByTagName('html')[0].innerHTML+'</head>');");
}

webView.loadUrl(StartURL);
ParseHtml(jInterface.html);   

public class MyJavaScriptInterface {

    private Context ctx;
    public String html;

    MyJavaScriptInterface(Context ctx) {
        this.ctx = ctx;
    }

    @JavascriptInterface
    public void showHTML(String _html) {
        html = _html;
    }
}