tags:

views:

105

answers:

1

I'm encoding challenged, so this is probably simple, but I'm stuck.

I'm trying to parse an XML file emailed to the App Engine's new receive mail functionality. First, I just pasted the XML into the body of the message, and it parsed fine with CElementTree. Then I changed to using an attachment, and parsing it with CElementTree produces this error:

SyntaxError: not well-formed (invalid token): line 3, column 10

I've output the XML from both emailing in the body and as an attachment, and they look the same to me. I assume pasting it in the box is changing the encoding in a way that attaching the file is not, but I don't know how to fix it.

The first few lines look this:

<?xml version="1.0" standalone="yes"?>
<gpx xmlns="http://www.topografix.com/GPX/1/0" version="1.0" creator="TopoFusion 2.85" xmlns:TopoFusion="http://www.TopoFusion.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd http://www.TopoFusion.com http://www.TopoFusion.com/topofusion.xsd"&gt;
 <name><![CDATA[Pacific Crest Trail section K hike 4]]></name><desc><![CDATA[Pacific Crest Trail section K hike 4.  Five Lakes to Old Highway 40 near Donner.  As described in Day Hikes on the PCT California edition by George & Patricia Semb. See pages 150-152 for access and exit trailheads. GPS data provided by the USFS]]></desc><author><![CDATA[MikeOnTheTrail]]></author><email><![CDATA[[email protected]]]></email><url><![CDATA[http://www.pcta.org]]&gt;&lt;/url&gt;
 <urlname><![CDATA[Pacific Crest Trail Association Homepage]]></urlname>
 <time>2006-07-08T02:16:05Z</time>

Edited to add more info:

I have a GPX file that's a few thousand lines. If I paste it into the body of the message I can parse it correctly, like so:

 gpxcontent = message.bodies(content_type='text/plain')
 for x in gpxcontent:
   gpxcontent = x[1].decode()
 for event, elem in ET.iterparse(StringIO.StringIO(gpxcontent), events=("start", "start-ns")):

If I attach it to the mail as an attachment, using Gmail. And then extract it like so:

if isinstance(message.attachments, tuple):
      attachments = [message.attachments]
      gpxcontent = attachments[0][3].decode()
      for event, elem in ET.iterparse(StringIO.StringIO(gpxcontent), events=("start", "start-ns")):

I get the error above. Line 3 column 10 seems to be the start of ![CDATA on the third line.

A: 

Ah, nevermind. There's a bug in App Engine that is calling lower() on all attachments when you decode them. This made the CDATA string invalid.

Here's a link to the bug report: http://code.google.com/p/googleappengine/issues/detail?id=2289#c2

smokey_the_bear