views:

313

answers:

4

I'm using the COBRA HTMLParser but haven't had luck parsing one particular tag. Here's the source:

<li id="eta" class="hentry">
  <span class="body">
    <span class="actions">
    </span>
    <span class="content">
    </span>
    <span class="meta entry">Content here
    </span>
    <span class="meta entry stub">Content here
    <span class="shared-content">
      Information by
      <a class="title" data="associate" href="/associate">Associate</a>
    </span>
    </span>
  </span>
</li>

I am able to use the following XPaths to get the proper information:

            XPath xpath = XPathFactory.newInstance().newXPath();
      NodeList nodeList = (NodeList) xpath.evaluate("//span[contains(@class, 'body')]", document, XPathConstants.NODESET);
      int length = nodeList.getLength();
      System.out.println(nodeList.getLength());
      for(int i = 0; i < length; i++) {
       Element element = (Element) nodeList.item(i);
       NodeList n = null;
       try {
        n = (NodeList) xpath.evaluate("span[contains(@class, 'content')]", element, XPathConstants.NODESET);
        String body = n.item(0).getTextContent();
        System.out.println("Content: " + body);
       } catch (Exception e) {};

       try {

        String date = (String) xpath.evaluate("span[contains(@class, 'meta entry')]/a/span/@data", element, XPathConstants.STRING);
        System.out.println("DATA: " + date);

        String source = (String) xpath.evaluate("//span[contains(@class, 'meta entry')]/span", element, XPathConstants.STRING);
        System.out.println("DATA: " + source);

       } catch (Exception e) {};

                //This does not work at all! I've tried every combination and still can't get it to run
       try {
        String info = (String) xpath.evaluate("//span[@class='shared-content']/a/@data", element, XPathConstants.STRING);
        System.out.println("INFO: " + info);
       } catch (Exception e) {};

      }

The last expression does not work whatever combination I try. I've tried the following too but it doesn't help,

  String info = (String) xpath.evaluate("//span[contains(@class, 'shared-content')]/a/@data", element, XPathConstants.STRING);
        String info = (String) xpath.evaluate("//span[contains(@class, 'meta entry info')]/span/a/@data", element, XPathConstants.STRING);

Any suggestions?

EDIT: There have been a couple of suggestions about the XML being illegal (which honestly I am not sure myself as to why it is illegal because I've seen it almost everywhere till now) but I don't have control over the XML though (at least until Monday till my other pals get back). I am trying to see the feasibility of writing a mashup including this information. Is there someway to disable checking or something?

Here's the XML that was parsed:

       <?xml version="1.0" encoding="UTF-8"?>
          <span class="body">
            <span class="content">TextContent</span>
            <span class="meta entry">TextContent</span>

          </span>

I guess the document is not getting parsed correctly.

+1  A: 

@Jherico,@Andrew Keith I don't know the COBRA HTMLParser, but combining #PCDATA with inner nodes is a legal XML format.
This could be defined like this in the DTD:

<!ELEMENT text_node     (#PCDATA|i|b|u)*>

This is the way a well-formatted HTML is still a legal XML.

jutky
A: 

I ran the following code

public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, XPathExpressionException {
    Document doc = XmlUtil.parseXmlResource("/temp.xml");
    for (Node n : XPathUtil.getNodes(doc, "//span[contains(@class, 'body')]")) {
        System.out.println(XPathUtil.getStringValue(doc, "//span[@class='shared-content']/a/@data"));
    }
}

And it output 'associate'. I think your XPath is fine. What is happening instead? And can you remove the empty catch blocks so we can see if you're actually getting exceptions?

Note, XmlUtil and XPathUtil are my own personal convenience functions to eliminate most of the XPath and XML boilerplate code.

Jherico
Thanks. I wonder why its not working here though. There are no exceptions being thrown at all which makes me wonder where it is going wrong. All it gives me is a blank string. Which library are you using by the way?
Legend
The built in Java 5 XML and XPath libraries.
Jherico
So i'll try to dump Cobra and use the built-in ones... Do you know any other better libraries?
Legend
Built-in parsers will parse XML, not HTML (they will parse XHTML, since that is an XML dialect, but not any random HTML).
Pavel Minaev
+1  A: 

XPathVisualizer is a nice XPath Visualizer tool, runs on Windows, lets you see the results of your XPath queries. Xcopy install, a single EXE file. Free.

I took it and ran your query in it, got this result:

alt text

Cheeso
A: 

I just ran your code sample as is (copy paste) and got this output. So everything seems fine. (which cobra version are you using? Me 0.98.4)

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate


Reproducible test(?)

  • Using javac/java version 1.6.0_16 (HotSpot Client: build 14.2-b01, mixed mode, sharing)
  • I downloaded 0.98.4 (cobra-0.98.4.zip) from here Sourceforge: Cobra HTML Toolkit download
  • Extracted js.jar and cobra.jar from the cobra-0.98.4.zip:\lib to a directory XXX
  • Wrote XMLTest.java and HTMLTest.java in same directory (!filenames are links to source)
  • Ran this to compile (windows): javac -cp .;cobra.jar;js.jar *.java
  • Then executed like this (output included)

XMLTest

java -cp .;cobra.jar;js.jar XMLTest 1

XMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate

HTMLTest

java -cp .;cobra.jar;js.jar HTMLTest 1

HTMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate
jitter
I am using the latest one off the official page which is 0.98.4. That is so strange. I just updated my post saying that the parser was not parsing the entire DOM. Are you using the same HTML parser provided by Cobra? I mean how did you construct the DOM?
Legend
Check expanded answer. Provided source too (tested with HTML and XML Parsing)
jitter