views:

178

answers:

1

Hey guys,

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.

I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:

<?xml version="1.0" encoding="utf-8"?> 
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"&gt;
    <ListDomainsResult>
        <DomainName>Audio</DomainName>
        <DomainName>Course</DomainName>
        <DomainName>DocumentContents</DomainName>
        <DomainName>LectureSet</DomainName>
        <DomainName>MetaData</DomainName>
        <DomainName>Professors</DomainName>
        <DomainName>Tag</DomainName>
    </ListDomainsResult>
    <ResponseMetadata>
        <RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
        <BoxUsage>0.0000071759</BoxUsage>
    </ResponseMetadata>
</ListDomainsResponse>

I pass in this XML to a parser with

XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());

and call eventReader.nextEvent(); a bunch of times to get the data I want.

Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:

com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?> 
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"&gt;&lt;ListDomainsResult&gt;&lt;DomainName&gt;Audio&lt;/DomainName&gt;&lt;DomainName&gt;Course&lt;/DomainName&gt;&lt;DomainName&gt;DocumentContents&lt;/DomainName&gt;&lt;DomainName&gt;LectureSet&lt;/DomainName&gt;&lt;DomainName&gt;MetaData&lt;/DomainName&gt;&lt;DomainName&gt;Professors&lt;/DomainName&gt;&lt;DomainName&gt;Tag&lt;/DomainName&gt;&lt;/ListDomainsResult&gt;&lt;ResponseMetadata&gt;&lt;RequestId&gt;42330b4a-e134-6aec-e62a-5869ac2b4575&lt;/RequestId&gt;&lt;BoxUsage&gt;0.0000071759&lt;/BoxUsage&gt;&lt;/ResponseMetadata&gt;&lt;/ListDomainsResponse&gt;
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
    at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
    ... (rest of lines omitted)

I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.

It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:

  • XML with and without the prolog
  • With and without newlines
  • With and without the "encoding=" attribute in the prolog
  • Both newline styles
  • With and without the chunking information present in the HTTP stream

And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?

Thanks!

+2  A: 

The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>

Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:

helloworld<?xml version="1.0" encoding="utf-8"?>  

or even a space or special character.

There are some special characters called byte order markers that could be in the buffer. Before passing the buffer to the Parser do this...

String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
Romain Hippeau
Hi Romain, thanks for the response! I've double and triple checked many times for anything in the buffer prior to the prolog (including hidden characters) but there simply isn't anything else there. I'll give switching to utf-16 encoding a try, however -- out of curiousity, where did you get the information that the XSD uses UTF-16?
Adrian Petrescu
@Adrian Petrescu Sorry, these are just examples If you are using DTDs or XSDs make sure they match with your XML. Before you parse the XML capture it in a String and surround it with '|' and print it to the console. This will tell you if you are passing in some extra characters.
Romain Hippeau
@Romain Hippeau Ah, I see :) Unfortunately I tried it and it doesn't appear to be the case in this situation. Thanks anyway!
Adrian Petrescu
@Adrian Petrescu I updated my post for you to try something else. Change your XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent()); to ... String xml = response.getContent(); xml = xml.trim().replaceFirst("^([\\W]+)<","<"); XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(xml);
Romain Hippeau
@Romain Thanks, I'll give this a try soon, even though I previously already checked for byte-order marks; maybe they're being introduced somewhere between the input stream and the XMLReader.
Adrian Petrescu