tags:

views:

48

answers:

3

I need to analyze a few thousand XML documents to see if some of them contains a certain construct. The problem is that some of the documents doesn't contain well formed XML.

The basic idea was to use fn:collection() and search inside nodes returned. But this only works if all documents in the collection are well formed.

Is it possible to do something similar but only parsing the well formed documents?

This is my XSLT, simplified, which works if all documents in $dir are well formed:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;

  <xsl:output method="text"/>
  <xsl:variable name="dir" as="xs:string">file:/c:/path/to/files/</xsl:variable>
  <xsl:variable name="files" select="concat($dir, '?select=*.xml')" as="xs:string"/>

  <xsl:template match="/">
    <xsl:variable name="docs" select="collection($files)"/>
    <xsl:variable name="names" select="
      for $i in $docs return
        distinct-values($i//*[exists(@an-attribute-to-find)]/local-name())"/>
    <xsl:value-of select="distinct-values($names)" separator="&#x0a;"/>
  </xsl:template>

</xsl:stylesheet>

Would it be possible to do something like this without manually sorting out the non well formed documents before transformation starts? Maybe you have a better suggestion to a solution?

+1  A: 

You could use TagSoup to ensure that all of the documents are well-formed.

If you are using Saxon, you can make TagSoup your parser by adding the following option:

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

Mads Hansen
TagSoup seem to be based on Saxon 6.5.5 which only handles XSLT 1.0.
Per T
Sorry, now I see that it's also possible to use with XSLT 2.0, but I still prefer a solution not depending on other parsing libraries.
Per T
+1  A: 

You could use the doc-available function to tell you if a document is well-formed.

Nick Jones
Yes, but the problem is that since `fn:collection()` collects a set of nodes it crasches during this function call. Otherwise I could've used `fn:doc-available()` for every document. Or did you have a completly different solution in mind? :)
Per T
+2  A: 

At present this is best done out of XSLT.

It can be done in XSLT if you provide as an exrternal parameter (<xsl:param>) to the transformation a list of all filenames to be processed -- then the transformation would use the standard XPath 2.0 function doc-available() and operate only on the document nodes returned by this function.

Dimitre Novatchev
I'll solve it this way. Then I'm able to test every document with `doc-available()`. Not what I was hoping for but it's good enough.
Per T