tags:

views:

347

answers:

3

I have an xml file(from federal government's data.gov) which I'm trying to read with scala's xml handlers.

val loadnode = scala.xml.XML.loadFile(filename) 

Apparently, there is an invalid xml character. Is there an option to just ignore invalid characters? or is my only option to clean it up first?

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x12) was found in the element content of the document.

Ruby's nokogiri was able to parse it with the invalid character.

+3  A: 

0x12 is only valid in XML 1.1. If your XML file states that version, you might be able to turn on 1.1 processing support in your SAX parser.

Otherwise, the underlying parser is probably Xerces, which, as a conforming XML parser, properly is complaining.

If you must handle these streams, I'd write a wrapper InputStream or Reader around my input file, filter out the characters with invalid Unicode values, and pass the rest on.

lavinio
+3  A: 
huynhjl
+6  A: 

To expand on @huynhjl's answer: the InputStream filter is dangerous if you have multi-byte characters, for example in UTF-8 encoded text. Instead, use a character oriented filter: FilterReader. Or if the file is small enough, load into a String and replace the characters there.

scala> val origXml = "<?xml version='1.1'?><root>\u0012</root>"                                          
origXml: java.lang.String = <?xml version='1.1'?><root></root>

scala> val cleanXml = xml flatMap { 
   case x if Character.isISOControl(x) => "&#x" + Integer.toHexString(x) + ";"
   case x => Seq(x) 
}
cleanXml: String = <?xml version='1.1'?><root>&#x12;</root>

scala> scala.xml.XML.loadString(cleanXml) 
res14: scala.xml.Elem = <root></root>
retronym
Good point on InputSteam and multi-byte encoding...
huynhjl