Extremely simple code not working in HtmlUnit

Mosty Mostacho picture Mosty Mostacho · Aug 26, 2011 · Viewed 9.2k times · Source

I'm working with HtmlUnit 2.9 (the stable version that was released this month). Do you have any idea why the following code is not working?

public class Main {

    public static void main(String[] args) {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
        webClient.setCssEnabled(true);
        webClient.setCssErrorHandler(new SilentCssErrorHandler());
        webClient.setThrowExceptionOnFailingStatusCode(false);
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setRedirectEnabled(false);
        webClient.setAppletEnabled(false);
        webClient.setJavaScriptEnabled(false);
        webClient.setPopupBlockerEnabled(true);
        webClient.setTimeout(60000);
        webClient.setPrintContentOnFailingStatusCode(false);

        System.out.println("This is printed on screen");
        try {
            webClient.getPage("http://www.2cash.info/index.php");
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println("This is NEVER printed on screen");
    }
}

I'm also adding the result of jstack. Notice I've marked a section that gets repeated constantly:

2011-08-26 03:15:45
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"JS executor for com.gargoylesoftware.htmlunit.WebClient@a7c45e" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166)
    at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x7675cc58> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000]
   java.lang.Thread.State: RUNNABLE
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157)
    at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141)
    at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109)
    at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73)
    at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46)
    at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506)
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742)
    at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136)
    at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)

    <THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149)
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    </THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>

    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at main.Main.<init>(Main.java:42)
    at main.Main.main(Main.java:23)

"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable 

"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition 

JNI global references: 1234

I think there is some kind of loop regarding the automatic loading of frames. If that is the case, is there any way to disable that behaviour to break the loop?

Thanks in advance!

Answer

Mosty Mostacho picture Mosty Mostacho · Aug 27, 2011

Well, although it is a horrible solution (workaround, actually...), I finally decided to disable the automatic loading of frames in HtmlUnit as adviced by one of the developers of HtmlUnit. This is what I did in detail:

  1. Downloaded the HtmlUnit source
  2. Downloaded maven from here
  3. Commented the content (the body of the method, not the declaration) of the loadFrames() method of the HtmlPage class located in htmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
  4. Compiled this custom code skipping tests with: mvn -Dmaven.test.skip=true clean compile package
  5. Got the new htmlunit-2.9.jar located in htmlunit-2.9/artifacts and replaced the current htmlunit-2.9.jar library file
  6. This step might be the most delicate one as it will depend on each application. However, I'll show you the changes I needed to do to my application.

You know how my original code was (look at the question). That would download all frames and iframes from a page. I'm adding an example on how to get a page with frames just loading the frames you want:

try {
    HtmlPage page = webClient.getPage("http://www.w3schools.com/HTML/tryit.asp?filename=tryhtml_noframes");
    HtmlInlineFrame frame = page.getFirstByXPath("//iframe[@name='view']");
    page = webClient.getPage(page.getFullyQualifiedUrl(frame.getSrcAttribute()));
    System.out.println(page.asXml());
} catch (Exception e) {
    e.printStackTrace();
}

After this library change, the content of the frame will be empty once the getPage() method finishes. Notice it won't be null, looks like it is just returning an empty frame. What we need to do is to download the content of the frames we are interested in manually, that's why I'm performing a getPage() again.

Well this is how I managed to selectively download frames and iframes with HtmlUnit. Any ideas on how to improve this will be appreciated. Anyway, I hope there will be added some way to disable the loading of the frames in HtmlUnit itself in the future, maybe adding a method such as getPage(URL url, boolean downloadFrames) or something.

Hope this helps someone out there!