Improve XPath efficiency for repeated, parameterized queries / Xalan selectSingleNode() method

Hi,

I am repeatedly performing the following XPath query (though parameterized by 'keywordText') around 40,000 times:

String query = SystemGlobal.YAHOO_KEYWORDSSUBNODE + "/" + SystemGlobal.YAHOO_KEYWORDNODE + "[" + SystemGlobal.YAHOO_ATTRKEYPHRASE + "='" + keywordText + "']";
CachedXPathAPI cachedXPathAPI = new CachedXPathAPI();
NodeIterator nl = cachedXPathAPI.selectNodeIterator(doc.getElementsByTagName(SystemGlobal.YAHOO_KEYWORDSROOT).item(0), query);

Node n;
if ((n = nl.nextNode()) != null) {
  keyword.setKeywordId(Long.parseLong(cachedXPathAPI.selectSingleNode(n, SystemGlobal.YAHOO_ATTRKEYID).getTextContent()));
  keyword.setKeyPhrase(cachedXPathAPI.selectSingleNode(n, SystemGlobal.YAHOO_ATTRKEYPHRASE).getTextContent());
  keyword.setStatus(mapStatus(cachedXPathAPI.selectSingleNode(n, SystemGlobal.YAHOO_ATTRSTATUS).getTextContent()));
  keyword.setCampaignId(Long.parseLong(cachedXPathAPI.selectSingleNode(n, "../../" + SystemGlobal.YAHOO_ATTRCAMPAIGNID).getTextContent()));
  keyword.setAdGroupId(Long.parseLong(cachedXPathAPI.selectSingleNode(n, "../" + SystemGlobal.YAHOO_ATTRADGROUPID).getTextContent()));

On the first run of the script, all 40,000 runs of this piece of code will have nl.nextNode() == null, and everything runs quite quickly. However, on the following runs, when nl.nextNode() != null, then things slow down a lot - this takes around an additional 40min to run (whereas the first run takes maybe 1 minute).

Oh, and the doc is constructed like so:

InputSource in = new InputSource(new FileInputStream(filename));
DocumentBuilderFactory dfactory = DocumentBuilderFactory.newInstance();
dfactory.setNamespaceAware(true);
doc = dfactory.newDocumentBuilder().parse(in);

I tried including the following lines

reportEvaluator = new XPathEvaluatorImpl(reportDoc);
reportResolver = reportEvaluator.createNSResolver(reportDoc);

and rather creating a NodeIterator, instead creating an XPathResult:

XPathResult result = (XPathResult)reportEvaluator.evaluate(query, doc.getElementsByTagName(SystemGlobal.YAHOO_KEYWORDSROOT).item(0), reportResolver, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null);

however this ran even slower

Is there a way in which I can speed up the running of this script? I have seen references to precompiled queries, though I haven't seen many actual details. Also, as seen in the code, I am using CachedXPathAPI, though the benefit for this case is not so great.

====================================

Message2, further clarification:

This script is run on 2 different types of xml files. So examples of the queries would be

Type1: Adgrp/Kwd[raw="SPL1000RMA"]
Type2: row[@keyword="SPL1000RMA"]

this is the query that is run 40,000 times, though each time evaluating raw or keyword against a different string.

The time consuming part comes in the next stage. If this query finds a result then 5 different calls to selectSingleNode are made (this is using the xalan library for Java). Examples of the differing queries are as follows (where n is the resulting node from the XPath query described above):

Type1: cachedXPathAPI.selectSingleNode(n, "kwId").getTextContent();

Type2: cachedXPathAPI.selectSingleNode(n, "@keywordid").getNodeValue()

As I see it, the important question here is, once I have the result from the initial query, how can I efficiently access the (child node text values) / (node properties) (varying between type1 and type2 described above). I have seen people say that this is slow due to many new objects being created. So the question relates directly to how this can be efficiently achieved in Java and xalan. I believe the actual XPath queries themselves are fine. Just performing them efficiently is the issue.

If you think I am missing the point then my applogies, and I will answer any questions to help get to the solution. All help is much appreciated!

Chris Allan

ansaurus

tags:

views:

answers:

Improve XPath efficiency for repeated, parameterized queries / Xalan selectSingleNode() method

related questions