tags:

views:

1800

answers:

9

I'm doing some screen scraping using WATIJ, but it can't read HTML tables (throws NullPointerExceptions or UnknownObjectExceptions). To overcome this I read the HTML and run it through JTidy to get well-formed XML.

I want to parse it with XPath, but it can't find a <table ...> by id even though the table is there in the XML plain as day. Here is my code:

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

The table is an empty String.

The table is in the XML, however. If I print the tidyHtml String it shows

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

I haven't used XPath before so maybe I'm missing something.

Can anyone set me straight? Thanks.

A: 

I'm not sure but I think you might have to replace the single quotes around searchResult with double quotes

String expression = "//table[@id=\"searchResult\"]";

not even sure if that's have you would escape the double quotes!

Christian Hagelid
+2  A: 

sorry, I can't add a comment - don't have 50 rep.

in response to Christian - no, it's single quotes. the escaping is correct, it's just not needed.

Dean - there's some redundant code in your sample, such as the compiled xPathExpression which isn't used. As for an answer - you may need to set the context/namespace.

Luke Schafer
A: 

I never used the XPath API of Java directly, I always used it through dom4j or in other languages (Perl and C). But I have a good understanding on how it works normally. At first you should probably parsed the input as a DOM document, this will greatly help. Also if you know that your document has ID you should parse it with loading the DTD or Schema that describes it this way the XML parser will mark and identify the nodes that have proper IDs. Once you have done this you can use your code with the DOM tree.

The documentation of XPath.evaluate(expression, item) shows that the second element should be a Node or a NodeList. This probably why you're having plenty of UnknownObjectExceptions.

If your XML parser is able to recognize the ID elements then you can access an element having an ID with the following XPath expression:

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

Using the XPath function id() is the most efficient way for accessing elements, that is when the elements are using an ID and have been declared in such way in the DTD or Schema.

potyl
A: 

youe xPath is correct... whatever it is that's failing, it isn't that.

Dr.Dredel
A: 

It looks like the problem is mostly with JTidy. I can get xpath to parse the JTidy-ied result by doing the following:

Remove all "<&amp>nbsp;". JTidy returns xhtml with "<&amp>nbsp;" outside of tags. Remove the In the tag remove the xmlns=... attribute Remove the "head" tags. (I usee some funny formatting because HTML entities won't display when typed properly)

JTidy also puts newlines in the middle of the text content if ... elements.

I'll have to look at other HTML -> XML conversion options. I gave Cobra a quick try, but it also failed to find my table by Id. I haven't tried manually cleaning up the result from Cobra, so I don't know how it compares to JTidy.

If you know of an HTML parser that returns good XML please let me know.

Dean Schulze
+1  A: 

I don't know anything about JTidy, but I for WATIJ, I believe the reason you are getting the NullPointer and UnknownObject Exceptions is because your XPATH is using lower cased nodes. So say you are using "//table[@id='searchResult']" as the xpath to lookup the table in WATIJ. That won't actually work because "table" is in lower case. For WATIJ, you need to have all the node names in upper case, eg: "//TABLE[@id='searchResult']". As an example, say you want to print the number of rows of that table using WATIJ, you'd do the following:

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

This code or answer may not be right since I've only started using WATIJ today. Though I did run into this same exact problem with xpaths. Took me a couple of hours of searching/testing before I noticed how all the xpaths were cased on this page: WATIJ User Guide Once I changed the casing in my xpaths, WATIJ was able to locate the objects so this should work for you as well.

Michael Cheng
Good observation. The Google WebDriver documents mention that the case sensitivity of xpath statements depends upon which browser you are using.
Dean Schulze
A: 

The solution was to drop WATIJ and switch to Google WebDriver. WebDriver documents how different browsers handle case in xpath statements.

Dean Schulze
A: 

Double quotes are definitely not required, and neither is uppercase. Namespaces and/or DTD are more likely the answer.

EJP
A: 

Uniue ID attributes need to be accessed by the id( ) method id('search')

Philip