



I'm trying to parse a page with links to articles whose important content looks like this:

<div class="article">
  <h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
  <a href="/performing-arts/">
    <span class="mth3">
      <span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1">                               
      EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
    <span class="mtp">The EIF&#39;s theatre programme wasn&#39;t as far-reaching as it could have been, but did find an exoticism in the familiar,  writes Mark Fisher </span>

Here is a minimal scraping case in Java using HtmlUnit and XPath (imports removed for brevity):

public class MinimalTest {
    public static void main(String[] args) throws Exception {
        WebClient client = new WebClient();
        System.out.println("Fetching front page");
        HtmlPage frontPage = client.getPage("");
        List<ArticleInfo> articleInfos = extractArticleInfo(frontPage);

        for (ArticleInfo info : articleInfos)
            System.out.println("Title: " + info.getTitle());
            System.out.println("Intro: " + info.getFirstPara());
            System.out.println("Link: " + info.getLink());

    @SuppressWarnings("unchecked") // xpath returns List<?>
    private static List<ArticleInfo> extractArticleInfo(HtmlPage frontPage) {
        System.out.println("Extracting article links");
        List<HtmlDivision> articleDivs = (List<HtmlDivision>) frontPage.getByXPath("//div[@class='article']");
        System.out.println(String.format("Found %d articles", articleDivs.size()));
        List<ArticleInfo> articleLinks = new ArrayList<ArticleInfo>(articleDivs.size());
        for (HtmlDivision div : articleDivs) {
        return articleLinks;

    private static class ArticleInfo {
        private final String title;
        private final String link;
        private final String firstPara;

        public ArticleInfo(final String link, final String title, final String firstPara) {
   = link;
            this.title = title;
            this.firstPara = firstPara;
        public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
            String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
            String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
            String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
            return new ArticleInfo(link, title, firstPara);
        public String getTitle() {
            return title;
        public String getFirstPara() {
            return firstPara;
        public String getLink() {
            return link;

Output I expect:

Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus 
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher 

What I get:

Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at com.intellij.rt.execution.application.AppMain.main(

Calling getByXPath works fine on a HtmlPage but seems to return nothing on any other HtmlElement. What's wrong? Is this a bug or implementation gap in HtmlUnit, or am I missing something subtle about XPath syntax?

Related question whose solution didn't work for me:

+2  A: 

You've tried to treat an attribute as an element. Try this instead:

String link = ((DomAttr) div.getFirstByXPath("//a/@href")).getValue();

Then I got

Fetching front page
Extracting article links
Found 24 articles
Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher
Link: /Register.aspx?

Also, your ArticleInfo class declares "link" to be a String, then assigns it some (custom?) class. I had to mangle things a bit just to get it to compile.

Rodney Gitzel
`Link` was a container class holding two strings, one representing the clickable words displayed, and the other representing the URL of the linked resource. Sorry, I should have factored it out, but I was a little rushed when I wrote this! I have ammended this in the above code now.
@Rodney Gitzel: +1 for catching that syntax error (`//a/@href/text()`)