tags:

views:

274

answers:

2

Why is twitter double encoding XML entity references?

Here's an example tweet:

xml entity ref test < & '

The response from statuses/friends_timeline:

<status>
  <created_at>Wed Jun 24 00:16:15 +0000 2009</created_at>
  <id>2302770346</id>
  <text>xml entity ref test &amp;lt; &amp; '</text>
  <source>web</source>
  <truncated>false</truncated>

shouldn't it be

&lt; &amp; &apos;

I did some more test, here's what happens in the http post to send the update:

sniff again < & '

post data:

authenticity_token=secret_sauce_removed&status=sniff+again+%3C+%26+'&twttr=true&return_rendered_status=true

I've confirmed Justin's observation that only < > is double encoded. First line is the xml repsonse, 2nd line json.

 <text>&quot; &amp; ' &amp;lt; &amp;gt;</text>
"text":"\" & ' &lt; &gt;"

Twitter documentation says "escaped and HTML encoded status body", I guess escaped means xml encoding < >.

But i still don't understand why they're doing it. No web pages are involved in the whole process. The tweet is sent through the rest API url-encoded, and it is retrieved as xml or json.

A: 

It looks like it's taking the HTML code, and sticking that inside of an XML field, so when you use your XML parser on the XML, you get valid HTML.

FryGuy
Andrew Medico
Justin Niessner
+1  A: 

It's double coded because the text property is quasi HTML Encoded text (looks like they're only encoding < and > so that you don't start/end a new html element in your tweet). Therefore, before the XML parses it for communication across the wire, you'd have:

xml entity ref test &lt; & '

That string then gets encoded again (so that when it is decoded, it is still the proper HTML Encoded text) which turns it in to the:

xml entity ref test &amp;lt; &amp; '

That you are getting back.

Justin Niessner

related questions