I have used this method to retrieve a webpage into an org.jsoup.nodes.Document
object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html()
, myDoc.text()
and myDoc.toString()
don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.
Use doc.outerHtml()
.
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.