views:

284

answers:

2

I thought values entered in forms are properly encoded by browsers.

But this simple test file "test_get_vs_encodeuri.html" shows it's not true:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;
<html><head>
   <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
   <title></title>
</head><body>

<form id="test" action="test_get_vs_encodeuri.html" method="GET" onsubmit="alert(encodeURIComponent(this.one.value));">
   <input name="one" type="text" value="Euro-€">
   <input type="submit" value="SUBMIT">
</form>

</body></html>

When hitting submit button:

encodeURICompenent encodes input value into "Euro-%E2%82%AC"

while browser into the GET query writes only a simple "Euro-%80"

  1. Could someone explain?

  2. How do i encode everything in the same way of the borwser's FORM (windows-1252) using Javascript??? (escape function does not work, encodeURIComponent does not work either)?

Or is encodeURIComponent doing unnecessary conversions?

A: 

I think the root of the problem is character encodings. If I mess around with charset in the meta tag and save the file with different encodings I can get the page to render in the browser like this:

Content encoding issue

That € looks a lot like what you're getting from encodeURIComponent. However I could find no combination of encodings which made any difference to what encodeURIComponent was returning. I can make a difference to what the GET query returns. This is your original page, submitting gives an URL like:

test-get-vs-encodeuri.html?one=Euro-%80

This is a UTF-8 version of the page, submitting gives an URL that looks like this (in Firefox):

http://www.boogdesign.com/examples/encode/test-get-vs-encodeuri-utf8.html?one=Euro-€

But if I copy and paste it I get:

http://www.boogdesign.com/examples/encode/test-get-vs-encodeuri-utf8.html?one=Euro-%E2%82%AC

So it looks like if the page is UTF-8 then the GET and encodeURIComponent match.

robertc
encodeURIComponent always assumes UTF-8. From http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf : 15.1.3.4 encodeURIComponent (uriComponent)The encodeURIComponent function computes a new version of a URI in which each instance of certain characters is replaced by one, two or three escape sequences representing the UTF-8 encoding of the character.
Mike Samuel
+3  A: 

This is a character encoding issue. Your document is using the charset Windows-1252 where the is at position 128 that is encoded with Windows-1252 as 0x80. But encodeURICompenent is expecting the input to be UTF-8, thus using Unicode’s charset where the is at position 8364 (PDF) that is encoded with UTF-8 0xE282AC.

A solution would be to use UTF-8 for your document as well. Or you write a mapping to convert UTF-8 encoded strings to Windows-1252.

Gumbo
@Gumbo: thanks I understand now. But this makes me think at another question that I already asked, what this damn encodeURIComponent is useful for? I mean the value encoded by the FORM can not be wrong even if I use cp1252, so why then should I use this damn encodeURIComponent to encode URI, can't I just use a simple JS escape fucntion that returns values identical to the ones encoded by the FORM. I know it might not be nice, but at the end I prefer to encode things exactly like a browser's FORM would do. http://stackoverflow.com/questions/2238515/encodeuricomponent-is-really-useful
Marco Demajo
Gumbo
Marco Demajo
@Marco Demaio: As `encodeURICompoenent` expects the string to be UTF-8 encoded, you will need to write your own encoding function.
Gumbo