views:

1043

answers:

3

I have a bunch of documents in a MarkLogic xml database. One document has:

<colors>
  <color>red</color>
  <color>red</color>
</colors>

Having multiple colors is not a problem. Having multiple colors that are both red is a problem. How do I find the documents that have duplicate data?

A: 

This should do the trick. I am not too familiar with mark logic, so the first line to get the set of documents may be wrong. This will return all documents which have 2 or more color elements with the same string value.

for $doc in doc()
let $colors = $doc//color/string(.)
where some $color in $colors
      satisfies count($colors[. = $color] > 1)
return doc()
Oliver Hallam
Is iterating over all of the documents the only way to go?
Sixty4Bit
A: 

For this XML:

<?xml version="1.0"?>
<colors>
    <color>Red</color>
    <color>Red</color>
    <color>Blue</color>
</colors>

Using this XSD:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

    <xsl:output method = "text" />  
    <xsl:strip-space elements="*"/>

    <xsl:template match="colors">

     <xsl:for-each select="color">
      <xsl:variable name="node_color" select="text()"/>
      <xsl:variable name="numEntries" select="count(../color[text()=$node_color])"/>
      <xsl:if test="$numEntries &gt; 1">
       <xsl:text>Color value of </xsl:text><xsl:value-of select="."/><xsl:text> has multiple entries &#xa;</xsl:text>  
      </xsl:if>
     </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

I got this output:

Color value of Red has multiple entries 
Color value of Red has multiple entries

So that will at least find them, but it will report each occurrence of a repeated color, not just every repeated color.

Stephen Friederichs
+1  A: 

Everything MarkLogic returns is just a sequence of nodes, so we can count the sequence size of the whole and compare it to the count of the sequence of distinct values. If they're not distinct, they're duplicate, and you have your subset.

for $c in doc()//colors
where fn:count($c/color) != fn:count(fn:distinct-values($c/color))
return $c
jtsnake