ansaurus

Question

Writing more efficient xquery code (avoiding redundant iteration)

Answer 1

+1 A:

This unfortunately is a shortcoming in XQuery 1.0

XQuery 1.1 adds the group by clause to the syntax to resolve this problem, and your problem would be resolved with:

for $person in /person
let $id = $person/@id
group by $id
return  <people id="{$id}">{
          for $name in distinct-values($person)
          return <name>{$name}</name>
        }</people>

Unfortunately XQuery 1.1 is not widely implemented, so for the moment you are stuck without the group by clause.

As a developer on XQSharp I cannot speak for any other implementations, but we have spent a lot of time tweaking our optimizer to spot common group-by patterns in XQuery 1.1 and perform them with the algorithm you have specified.

In particular, the following version of your query:

declare variable $people as element(person, xs:untyped)* external;

for $id in distinct-values($people/@id)
return <people id="{$id}">{
          for $person in $people
          where $person/@id = $id
          return <name>{$person}</name>
       }</people>

is spotted as a group-by, as is evidenced by the following query plan:

library http://www.w3.org/2005/xpath-functions external;
library http://www.w3.org/2001/XMLSchema external;
declare variable $people external;

for $distinct-person in $people
let $id := http://www.w3.org/2005/xpath-functions:data($distinct-person/attribute::id)
group by
  $id
aggregate
  element {name} { fs:item-sequence-to-node-sequence($distinct-person) }
as
  $:temp:19
return
  element {person} { (attribute {id} { $id } , fs:item-sequence-to-node-sequence($:temp:19)) }

Note that the type annotation as element(person, xs:untyped)* is required, as without knowing that the nodes are untyped (not validated against a schema), the query processor has no way of knowing that $person/@id doesn't have multiple items in its data value. XQSharp does not yet support group by expressions where each node can have more than one key. However in this case a left outer join is still spotted, and so the complexity should be roughly n log n and not quadratic as you are experiencing.

Unfortunately though adding in the distinct-values around the set of people in the group (to filter out duplicate names) seems to stop XQSharp from finding the join; this has been filed as a bug. For now this could be solved by doing the query in two passes - grouping the names by id, and removing duplicate names.

In summary, there is not a better approach in XQuery 1.0, but some implementations (eg. XQSharp) will be able to evaluate this efficiently. If in doubt, check the query plan.

For a more detailed look at the join optimizations performed by XQSharp, take a look at this blog post.

Oliver Hallam 2010-05-13 13:19:20

The performance will be slow because it is quite computation intensive...

vtd-xml-author 2010-05-13 23:43:44

Thanks. A very informative answer. I'm using Saxon, the open-source version of which doesn't include support for 1.1, so I guess using the 'group by' is out of the question. In any case, I'm glad to know my failure to find an efficient solution wasn't caused by a lack of imagination.I'll definitely take a look at XQSharp. Another option I've been toying with is to just write a Python script so I can combine XPath (using a library like xml2) with Python data structures and control flow.

Coquelicot 2010-05-14 08:33:59

Answer 2

A:

If you use an XML database supporting update, such as eXist db, then you can do the grouping just like the Pythonesque code directly into an XML document where presumably the result is needed anyway for later processing.

let $persons := doc("/db/temp/p3.xml")/persons
let $person-groups := doc("/db/temp/p2.xml")/person-groups
for $person in $persons/person
let $name := element name {$person/text()}
let $person-group := $person-groups/person-group[@id=$person/@id]
return
   if ($person-group) 
   then update insert $name into $person-group
   else update insert element person-group {attribute id {$person/@id}, $name} 
       into $person-groups

For my experiments with 10,000 person nodes over 100 distinct ids, eXist on our server has a throughput of about 100 nodes per second.

Note that the update extension to XQuery in eXist are not quite the same syntax as the XQuery Update syntax

Chris Wallace 2010-05-17 23:10:31

Answer 3

A:

Another option: use a map.

let $map := map:map()
let $people :=
  for $person in $all-people
  return map:put($map, $person/@id, 
    (map:get($map, $person/@id), <name>{$person/text()}</name>))
return
  for $id in map:keys($map)
  return 
    <person id="{$id}">{map:get($map, $id)}</person>

Dave Cassel 2010-06-01 12:52:20

Answer 4

+1 A:

Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.

Here is a simple XSLT 2.0 solution (for convenience two of the three documents are represented by <xsl:variable>s):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:variable name="vDoc2">
  <persons>
   <person id="1">Sir Paul McCartney</person>
   <person id="2">Richard Starkey</person>
  </persons>
 </xsl:variable>

 <xsl:variable name="vDoc3">
  <persons>
   <person id="1">James Paul McCartney</person>
   <person id="2">Richard Starkey - Ringo Starr</person>
  </persons>
 </xsl:variable>

 <xsl:template match="/">
  <xsl:for-each-group group-by="@id" select=
   "(/ | $vDoc2 | $vDoc3)/*/person">

   <person id="{current-grouping-key()}">
     <xsl:for-each select="current-group()">
       <name><xsl:sequence select="text()"/></name>
     </xsl:for-each>
   </person>

  </xsl:for-each-group>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the following XML document:

<persons>
    <person id="1">Paul Mcartney</person>
    <person id="2">Ringo Starr</person>
</persons>

the wanted, correct result is produced:

<person id="1">
   <name>Paul Mcartney</name>
   <name>Sir Paul McCartney</name>
   <name>James Paul McCartney</name>
</person>
<person id="2">
   <name>Ringo Starr</name>
   <name>Richard Starkey</name>
   <name>Richard Starkey - Ringo Starr</name>
</person>

Dimitre Novatchev 2010-09-04 17:09:15

ansaurus

tags:

views:

answers:

Writing more efficient xquery code (avoiding redundant iteration)

related questions