Here's an XSLT 2.0 method.
Assuming that $docs
contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use <xsl:for-each-group>
to do that:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:value-of select="$name" />
...
</xsl:for-each-group>
Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
Second, you need a sequence of the number of elements of that name in each of the documents:
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
And now you can do the calculations. Average, minimum and maximum can be calculated with the avg()
, min()
and max()
functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.
Putting that together:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).
I hope it's useful none the less.
UPDATE:
Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.
You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.
Now the full XSLT:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs">
<xsl:param name="dir" as="xs:string"
select="'file:///path/to/default/directory?select=*.xml'" />
<xsl:output method="text" />
<xsl:variable name="docs" as="document-node()*"
select="collection($dir)" />
<xsl:template name="main">
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
And to run it you would do something like:
> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml
This tells Saxon to start the process with the template named main
, to set the dir
parameter to file:///path/to/your/directory?select=*.xml
and send the output to report.txt
.