ansaurus

Question

XPATH based content extraction from html pages

Answer 1

A:

I don't understand what you want to achieve and how it relates to XPath. If you want to map XML to Java objects then JAXB might help, but it is based on XML schemas, not on XPath.

Martin Honnen 2010-07-29 16:07:44

The use case is for scraping diff web pages. The content in html will not be extractable by xml binders since it is not xml we are dealing with. Apologies if i have given that impression. Besides i don't want to write java code for every new page we scrape; it should as simple as writing xpaths in config for different contents and get the data.

Nayn 2010-07-30 05:42:19

Answer 2

+3 A:

I am not sure I got your question, but it sounds like you want to use XPath on HTML documents.

To use XPath, the HTML document being prased needs to be well-formed. There are several HTML parsers for Java; this article compares 4 of them.

HtmlCleaner seems to provide what you are after. It allows a subset of XPaths to be performed on "cleaned-up" HTML documents. Apparently it doesn't support the full set of XPath expressions though, see the documentation.

If you require more complex XPath expressions than what HtmlCleaner supports, you may need to use the javax.xml.xpath package with a well-formed XHTML document. JTidy can convert an HTML document to an XHTML one.

I hope this answers your question.

William 2010-08-21 13:30:13

Please see my EDITs

Nayn 2010-08-25 14:38:00

Does "html:div[@class='divclass']/item/*" give what you want?

William 2010-08-25 23:49:02

It would list down all the nodes, I might want to fetch only a few. Also the hierarchy could any level deep.

Nayn 2010-08-26 08:35:54

Answer 3

+1 A:

Why not apply XPath in two steps.

First an XPath(s) to get the records (the lines in your output):

//div[@class='divclass']/item

Then the XPath(s) to get the fields (the columns), relative to each record:

item_name
item_qty
item_price

Here's working code (in Javascript, Windows scripting), gives you the output you want:

var doc = new ActiveXObject("MSXML.DOMDocument");
doc.load("test.xml");

// XPATH #1
var recordXPath = "//div[@class='divclass']/item";
// XPATHS #2, in a dictionary ("field name":"XPath")
var fieldXPaths = { item_name : "item_name",
                    item_qty : "item_name",
                    item_price : "item_price" };

var items = doc.selectNodes(recordXPath);
for (var itemCtr = 0; itemCtr < items.length; itemCtr++) {
    var item = items[itemCtr];
    var fieldEntries = [];

    for (var fieldName in fieldXPaths) {
        var fieldXPath = fieldXPaths[fieldName];
        var fieldNode = item.selectSingleNode(fieldXPath);
        fieldEntries.push(fieldNode.tagName + ":" + fieldNode.text);
    }
    WScript.Echo(fieldEntries.join(";"));
}

Jerome 2010-08-25 17:12:59

Could you give me some example of nested xpaths and it's application? I thought the xpaths are applied on the whole document, it returns a Document object and then we've to walk through the nodelist to get the content/attribute/attribute-value etc.

Nayn 2010-08-26 05:57:17

XPaths can be applied to the document or relative to a node. See edit with code above

Jerome 2010-08-26 07:45:43

Answer 4

+2 A:

I think XQuery is a great solution for screen scrapping. You can use the Saxon processor for executing your xqueries. Moreover, you can use Piggy Bank Firefox extension to easily find the XPath expressions, regarding the content you want to extract from a web page, that you can use within your xqueries.

jaxvy 2010-08-25 18:20:46

Agreed, I've used XQuery to that effect in the past, and it works great. One will need to couple it with an HTML to DOM parser like Tidy if the parsed HTML comes "from the wild".

Tassos Bassoukos 2010-08-26 07:55:43

Answer 5

A:

I don't know if this helps but I use XSLT to go go the other way from data to HTML. Seems to me that you just need to structure the XPATH execution a little and XSLT is good for this.

MikeAinOz 2010-08-25 22:40:36

ansaurus

tags:

views:

answers:

XPATH based content extraction from html pages

related questions