tags:

views:

46

answers:

1

hi all, I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup.

Can htmlunit accomplish that? If so, how? Or is there another library I should be looking at?

for example if the page contains

<body><p>para1 test info</p><div><p>more stuff here</p></div>

I'd like it to output

para1 test info more stuff here

thanks

+1  A: 

http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    final String pageAsXml = page.asXml();
    assertTrue(pageAsXml.contains("<body class=\"composite\">"));

    final String pageAsText = page.asText();
    assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}

NB: the page.asText() command seems to offer exactly what you are after.

Javadoc for asText (Inherited from DomNode to HtmlPage)

Syntax
anyway to do this with the htmlclient library?
Looks like it is possible (I assume you are referring to Apache HttpClient) - http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/examples/TrivialApp.java?view=markup
Syntax