views:

2819

answers:

2

I'm building a web service and have a node that accepts a POST to create a new resource. The resource expects one of two content-types - an XML format I'll be defining, or form-encoded variables.

The idea is that consuming applications can POST XML directly and benefit from better validation etc., but there's also an HTML interface that will POST the form-encoded stuff. Obviously the XML format has a charset declaration, but I can't see how I detect the form's charset just from looking at the POST.

A typical post to the form from Firefox looks like this:

POST /path HTTP/1.1
Host: www.myhostname.com
User-Agent: Mozilla/5.0 [...etc...]
Accept: text/html,application/xhtml+xml, [...etc...]
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 41

field1=value1&field2=value2&field3=value3

Which doesn't seem to contain any useful indication of the character set.

From what I can see, the application/x-www-form-urlencoded type is entirely defined in HTML, which just lays out the %-encoding rules, but doesn't say anything about what charset the data should be in.

Basically, is there any way of telling the character set if I don't know the character set the HTML originally presented was? Otherwise I'll have to try and guess the character set based on what chars are present, and that's always a bit iffy from what I can tell.

+1  A: 

The Charset used in the POST will match that of the Charset specified in the HTML hosting the form. Hence if your form is sent using UTF-8 encoding that is the encoding used for the posted content. The URL encoding is applied after the values are converted to the set of octets for the character encoding.

AnthonyWJones
I was more wondering if there was a stateless way of approaching it, as in without knowledge of the form's character set.
Ciaran McNulty
No. The client would have to explicitally declare the charset in the HTTP headers for that to work.
Remy Lebeau - TeamB
+4  A: 

the default encoding of a HTTP POST is ISO-8859-1.

else you have to look at the Content-Type header that will then look like

Content-Type: application/x-www-form-urlencoded ; charset=UTF-8

You can maybe declare your form with

<form enctype="application/x-www-form-urlencoded;charset=UTF-8">

or

<form accept-charset="UTF-8">

to force the encoding.

Some references :

http://www.htmlhelp.com/reference/html40/forms/form.html

http://www.w3schools.com/tags/tag_form.asp

chburd
Does this work with common browsers?
Ciaran McNulty
well I don't know, I'm not a Web developper, i've added links where you can find some references.
chburd