How to save a Jsoup Document to an HTML file?

Ali Khezeli picture Ali Khezeli · Jul 11, 2014 · Viewed 20.3k times · Source

I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

How should I write this object to a HTML file? The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.

Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.

Answer

Gondy picture Gondy · Feb 19, 2015

Use doc.outerHtml().

import org.apache.commons.io.FileUtils;

public void downloadPage() throws Exception {
        final Response response = Jsoup.connect("http://www.example.net").execute();
        final Document doc = response.parse();

        final File f = new File("filename.html");
        FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
    }

Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.