tags:

views:

125

answers:

4

Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find

<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>

And in another I might find:

<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>

I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:

<person id=1>
    <name>Paul McCartney</name>
    <name>Sir Paul McCartney</name>
    <name>James Paul McCartney</name>
</person>
<person id=2>
    ...
</person>

The way I'm doing this now in xquery is something like this (pseudocode-esque):

let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
    return <person id={$id}>
    {
    for $unique-name in distinct-values
            (
            for $name in ( [all names] )
            where $name/@id=$id
            return $name
            )
        return <name>{$unique-name}</name>
    }
    </person>

The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.

So here's how I'd really like to do this (in Pythonic pseudocode):

persons = {}
for id in ids:
    person[id] = set()
for person in all_the_people_in_my_xml_document:
    persons[person.id].add(person.name)

There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.

Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.

+1  A: 

This unfortunately is a shortcoming in XQuery 1.0

XQuery 1.1 adds the group by clause to the syntax to resolve this problem, and your problem would be resolved with:

for $person in /person
let $id = $person/@id
group by $id
return  <people id="{$id}">{
          for $name in distinct-values($person)
          return <name>{$name}</name>
        }</people>

Unfortunately XQuery 1.1 is not widely implemented, so for the moment you are stuck without the group by clause.

As a developer on XQSharp I cannot speak for any other implementations, but we have spent a lot of time tweaking our optimizer to spot common group-by patterns in XQuery 1.1 and perform them with the algorithm you have specified.

In particular, the following version of your query:

declare variable $people as element(person, xs:untyped)* external;

for $id in distinct-values($people/@id)
return <people id="{$id}">{
          for $person in $people
          where $person/@id = $id
          return <name>{$person}</name>
       }</people>

is spotted as a group-by, as is evidenced by the following query plan:

library http://www.w3.org/2005/xpath-functions external;
library http://www.w3.org/2001/XMLSchema external;
declare variable $people external;

for $distinct-person in $people
let $id := http://www.w3.org/2005/xpath-functions:data($distinct-person/attribute::id)
group by
  $id
aggregate
  element {name} { fs:item-sequence-to-node-sequence($distinct-person) }
as
  $:temp:19
return
  element {person} { (attribute {id} { $id } , fs:item-sequence-to-node-sequence($:temp:19)) }

Note that the type annotation as element(person, xs:untyped)* is required, as without knowing that the nodes are untyped (not validated against a schema), the query processor has no way of knowing that $person/@id doesn't have multiple items in its data value. XQSharp does not yet support group by expressions where each node can have more than one key. However in this case a left outer join is still spotted, and so the complexity should be roughly n log n and not quadratic as you are experiencing.

Unfortunately though adding in the distinct-values around the set of people in the group (to filter out duplicate names) seems to stop XQSharp from finding the join; this has been filed as a bug. For now this could be solved by doing the query in two passes - grouping the names by id, and removing duplicate names.

In summary, there is not a better approach in XQuery 1.0, but some implementations (eg. XQSharp) will be able to evaluate this efficiently. If in doubt, check the query plan.

For a more detailed look at the join optimizations performed by XQSharp, take a look at this blog post.

Oliver Hallam
The performance will be slow because it is quite computation intensive...
vtd-xml-author
Thanks. A very informative answer. I'm using Saxon, the open-source version of which doesn't include support for 1.1, so I guess using the 'group by' is out of the question. In any case, I'm glad to know my failure to find an efficient solution wasn't caused by a lack of imagination.I'll definitely take a look at XQSharp. Another option I've been toying with is to just write a Python script so I can combine XPath (using a library like xml2) with Python data structures and control flow.
Coquelicot
A: 

If you use an XML database supporting update, such as eXist db, then you can do the grouping just like the Pythonesque code directly into an XML document where presumably the result is needed anyway for later processing.

let $persons := doc("/db/temp/p3.xml")/persons
let $person-groups := doc("/db/temp/p2.xml")/person-groups
for $person in $persons/person
let $name := element name {$person/text()}
let $person-group := $person-groups/person-group[@id=$person/@id]
return
   if ($person-group) 
   then update insert $name into $person-group
   else update insert element person-group {attribute id {$person/@id}, $name} 
       into $person-groups

For my experiments with 10,000 person nodes over 100 distinct ids, eXist on our server has a throughput of about 100 nodes per second.

Note that the update extension to XQuery in eXist are not quite the same syntax as the XQuery Update syntax

Chris Wallace
A: 

Another option: use a map.

let $map := map:map()
let $people :=
  for $person in $all-people
  return map:put($map, $person/@id, 
    (map:get($map, $person/@id), <name>{$person/text()}</name>))
return
  for $id in map:keys($map)
  return 
    <person id="{$id}">{map:get($map, $id)}</person>
Dave Cassel
+1  A: 

Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.

Here is a simple XSLT 2.0 solution (for convenience two of the three documents are represented by <xsl:variable>s):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:variable name="vDoc2">
  <persons>
   <person id="1">Sir Paul McCartney</person>
   <person id="2">Richard Starkey</person>
  </persons>
 </xsl:variable>

 <xsl:variable name="vDoc3">
  <persons>
   <person id="1">James Paul McCartney</person>
   <person id="2">Richard Starkey - Ringo Starr</person>
  </persons>
 </xsl:variable>

 <xsl:template match="/">
  <xsl:for-each-group group-by="@id" select=
   "(/ | $vDoc2 | $vDoc3)/*/person">

   <person id="{current-grouping-key()}">
     <xsl:for-each select="current-group()">
       <name><xsl:sequence select="text()"/></name>
     </xsl:for-each>
   </person>

  </xsl:for-each-group>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the following XML document:

<persons>
    <person id="1">Paul Mcartney</person>
    <person id="2">Ringo Starr</person>
</persons>

the wanted, correct result is produced:

<person id="1">
   <name>Paul Mcartney</name>
   <name>Sir Paul McCartney</name>
   <name>James Paul McCartney</name>
</person>
<person id="2">
   <name>Ringo Starr</name>
   <name>Richard Starkey</name>
   <name>Richard Starkey - Ringo Starr</name>
</person>
Dimitre Novatchev