ansaurus

Question

Getting the Text of a webpage with HtmlUnit?

Answer 1

+1 A:

http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    final String pageAsXml = page.asXml();
    assertTrue(pageAsXml.contains("<body class=\"composite\">"));

    final String pageAsText = page.asText();
    assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}

NB: the page.asText() command seems to offer exactly what you are after.

Javadoc for asText (Inherited from DomNode to HtmlPage)

Syntax 2010-07-07 05:15:10

anyway to do this with the htmlclient library?

2010-07-07 18:45:05

Looks like it is possible (I assume you are referring to Apache HttpClient) - http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/examples/TrivialApp.java?view=markup

Syntax 2010-07-08 01:57:51

ansaurus

tags:

views:

answers:

Getting the Text of a webpage with HtmlUnit?

related questions