tags:

views:

367

answers:

1

I'm trying to parse XML using this class. When I type out a simple file, it works fine.

<testData>
    <text>
     odp
    </text>
</testData>

Here is my main

public static void main(String[] args) { 
 Xml train = new Xml(args[0], "trainingData");
 Xml test = new Xml(args[1], "testData");
}

However, when I use the file I got by copying and pasting from MSFT Office OneNote, I get errors:

Exception in thread "main" java.lang.RuntimeException: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at odp.compling.Xml.rootElement(Xml.java:41)
    at odp.compling.Xml.<init>(Xml.java:61)
    at odp.compling.ParseTreeAnalysis2.main(ParseTreeAnalysis2.java:10)
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at odp.compling.Xml.rootElement(Xml.java:33)
    ... 2 more

What is causing this? I edited the problematic XML file in Notepad++ and changed the encoding to UTF-8. This caused a bunch of weird characters from the accents/special quotation marks, which I edited out. Am I not converting properly?

(I don't know anything about text encoding formats, in case you couldn't tell.)

+1  A: 

Your file is not properly encoded as UTF-8 but your parser is expecting UTF-8 encoding.

It would help to pin-point the problem is you can post a hexdump of the file.

ZZ Coder
how can I generate such a hex dump?
Rosarch
On Unix/Linux/Mac, use "od -x file". On Windows, you have to download a tool, like this one: http://www.richpasco.org/utilities/hexdump.html
ZZ Coder