tags:

views:

75

answers:

2

I want the entire HTML for a given HtmlPage object.

What property should I use?

+1  A: 

The quickest way to do this is HtmlPage.asXml -- It may not be perfect, as in, it may not exactly match what you would see if you did "View Source" in a normal browser, but I've found it to be very helpful for developing and debugging HtmlUnit code.

MatrixFrog
Yep, this actually reconstructs the document, so you won't get the HTML as it was passed on the wire -- which makes things either nicer (if you want a tidied version of the doc) or harder (if you're looking for the original HTML).
delfuego
+3  A: 

In HtmlUnit, an HtmlPage implements the Page interface; that means that you can use Page#getWebResponse() to get the entire web response returned to generate the HtmlPage, and from there it's easy (WebResponse#getContentAsString()). Here's a method that does what you want...

public String getRawPageText(WebClient client, String url)
        throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    HtmlPage page = client.getPage(url);
    return page.getWebResponse().getContentAsString();
}

Or, using an HtmlPage object that you've already fetched:

public String getRawPageText(HtmlPage page) {
    return page.getWebResponse().getContentAsString();
}
delfuego
Since mrblah specifically mentioned an HtmlPage object, I would simply make the page itself an argument, instead of passing in a WebClient and a URL. But the essential idea is absolutely correct.
MatrixFrog
Totally -- I just added that.
delfuego