views:

89

answers:

6

When I type in Firefox (in the address line) URL like http://www.example.com/?query=Траливали, it is automatically encoded to http://www.example.com/?query=%D2%F0%E0%EB%E8%E2%E0%EB%E8.

But URL like http://www.example.com/#ajax_call?query=Траливали is not converted.

Other browsers such as IE8 do not convert query at all.

The question is: how to detect (in PHP) if query is encoded? How to decode it?

I've tried:

  1. $str = iconv('cp1251', 'utf-8', urldecode($str) );

  2. $str = utf8_decode(urldecode($str));

  3. $str = (urldecode($str));

  4. many functions from http://php.net/manual/en/function.urldecode.php Nothing works.

Test:

$str = $_GET['str'];

d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == urldecode('%D2%F0%E0%EB%E8%E2%E0%EB%E8'));

d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == $str);

d('Траливали' == $str);

d(urldecode($str));

d(utf8_decode(urldecode($str)));

!!! d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == urlencode($str)); !!!

Returns:

[false] [false] [false] ��������� ???? [true]

Some kind of a solution: http://www.example.com/Траливали/ - send a query as a url part and parse with mod_rewrite.

+1  A: 
rawurldecode($_GET['query']);

but this should actually have been done already by php ;-)

edit you're stating "nothing works" - what are you trying? if the text doesn't appear on screen as you want it, when you echo $_GET['query']; for example, your problem might be the encoding you are specifying for the page sent back to the browser.

Include a line

header("Content-Type: text/html; charset=utf-8");

and see if it helps.

mvds
There is such header (of course).
topright
please show the entire script then and show us what exactly fails.
mvds
I added some tests in the post.
topright
A: 

URLs are limited to certain ascii chars. Non-url friendly chars are supposed to be url-encoded (the %hh encoding you see). Some browsers might automatically encode urls that appear on the addr line.

seand
-1: There is no problem with passing UTF-8 in query. Multibyte characters will simply be encoded in two bytes, which will then be decoded properly.
Andrew Moore
But the browser is still encoding the url behind the scenes. The server should see a well-formed url which the webapp will be able to decode.
seand
@seand: The browser does not need to understand the charset to URL encode. It simply reads 8 bytes and transforms it into an hexadecimal value. Any character not considered `printable ascii` is encoded by the user-agent per RFC3986.
Andrew Moore
+5  A: 

It is not converted as having the query part of the URL after the fragment is not valid.

RFC 3986 defines a URI as composed of the following parts:

     foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment

The order cannot be changed. Therefore,

URL1: http://www.example.com/?query=Траливали#ajax_call

will be handled properly while

URL2: http://www.example.com/#ajax_call?query=Траливали

will not. If we look at URL2, IE actually handles the URL properly by detecting the fragment as #ajax_call?query=Траливали without a query. Fragment is always last and are never sent to the server.

IE will properly encode the query component of URL1 as it will detect it as a query.

As for decoding in PHP, %D2 and similar is automatically decoded in the $_GET['query'] variable. The reason why the $_GET variable was not properly populated was because in URL2, there is no query according to the standard.

Also, one last thing... when doing 'Траливали' == $_GET['query'], this will only be true if your PHP script itself is encoded in UTF-8. Your text editor should be able to tell you the encoding of your file.

Andrew Moore
Yes, indeed. Thank you for such a good reply. But it is a common practice to use `fragment` for ajax addresses. And it is a source of a problem, not a solution.
topright
@topright: **It is the solution.** I'm not saying to drop the fragment all together, I'm saying that your fragment **should always be last**. Rewrite your links to respect that. PHP does not handle the `query` after the `fragment` as it does not expect it to the there (it's illegal according to RFC3986). IE does not even bother to try encoding it as it is expecting a fragment (which are limited to ASCII characters only).
Andrew Moore
It's not. The problem occurs even without `query` in `fragment`.
topright
@topright: when doing `'Траливали' == $_GET['query']`, you need to make sure your PHP file is also encoded in UTF-8... Check that in your text editor.
Andrew Moore
"if your PHP script itself is encoded in UTF-8". You are right. My script is encoded as UTF-8 without BOM (using Notepad++).
topright
@topright `#ajax_call?query=Траливали` means that the fragment consists of the text `ajax_call?query=Траливали`. The fragment **is not send to the server**. In other words, **anything you put after `#` in the URL is never send to the server**.
deceze
@topright: `fragments` are great for ajax as they are stored in the history yet do not waste bandwidth by sending useless data to the server. Which is why they are used in AJAX scenarios where it is parsed client-side. What you are trying to do will not work with fragments (they are never sent to PHP) which is why we tell you to use queries instead. You choose to ignore that advice.
Andrew Moore
Fragment is sent to the server via Ajax call. Server recieves Траливали that way.
topright
Anyway, do you understand my question?
topright
@topright No, your question just got confusing. Where did the heretofore unmentioned AJAX call come from and how does it send the fragment?
deceze
@topright: No **they are never sent to the server.** Not when using AJAX, not when using a regular GET. Please read [RFC 3986 Section 3.5](http://tools.ietf.org/html/rfc3986#section-3.5) and [Wikipedia](http://en.wikipedia.org/wiki/Fragment_identifier#Processing). Fragments in Javascript application are processed client-side, not server-side.
Andrew Moore
Don't believe me? Try it out... `echo $_SERVER['REQUEST_URI'];` will give you exactly the request as seen by Apache. You'll quickly notice the fragment is missing. Also check your logs... There will be no fragment.
Andrew Moore
@deceze I think it would be better not to think of this as a fragment but as some bit of data being sent through AJAX call. And yes, the whole question is incredible mess.
Col. Shrapnel
@Col But it all depends on whether Траливали is part of the fragment, or if it's posted in the AJAX request body. The former won't work, the latter should.
deceze
@deceze I vote for the latter, as it will make a little sense of the question :)
Col. Shrapnel
@Col Your vote in the OP's ear... :o)
deceze
Let's reformulate this. Of course, fragment is not sent to the server as it is. But fragment contains part of url (path and query). Javascript uses it to build the url. Ajax sends this query (taken from the fragment) to the server. It is common practice and I'm surprised that some of you don't know it.
topright
"the whole question is incredible mess. – Col. Shrapnel" My question is (quote): "how to detect (in PHP) if query is encoded? How to decode it?" :)
topright
@topright: See, now the question is clear, and I'm willing to bet that the problem lies in your JavaScript Fragment-To-Query code.... Can you post that bit of code?
Andrew Moore
@Andrew Moore: The problem occurs with or without using Ajax.
topright
@topright: `$str = mb_convert_encoding($_GET['query'], 'utf-8');`. Firefox encodes in cp1251 by default. `urldecode` is handled transparently by PHP.
Andrew Moore
+2  A: 

How the fragment is encoded, is unfortunately, browser-dependent:

Is fragment ID (hash) encoded by applying RFC-mandated URL escaping rules?
MSIE: NO
Firefox: PARTLY
Safari: YES
Opera: NO
Chrome: NO
Android: YES

As to the question of what encoding the browser uses to encode international (read: non-ASCII) characters before converting them to %nn escape sequences, "most browsers deal with this by sending UTF-8 data by default on any text entered in the URL bar by hand, and using page encoding on all followed links." (same source).

Artefacto
Nice comment, thank you.
topright
Not that it really matters how the fragment is encoded at it is only processed client side.
Andrew Moore
@And How is so? For javascript "á" != "%C3%A1"
Artefacto
A: 

The answer is easy: string being encoded always. As it's stated in the HTTP standard.
And what is firefox displays - it doesn't matter.

Also, as PHP decode query string automatically, no decoding required either.

Note that '%D2%F0%E0%EB%E8%E2%E0%EB%E8' is single-byte encoding, so, you have your page probably in 1251. At least HTTP header says that to the browser.
While AJAX always use utf-8.

So, you have just to either use single encoding (utf-8) for your pages, or distinguish ajax calls from regular ones.

As for the fragment - do not use a fragment value to send it to the server. Have a JS variable, and then use it twice - to set a fragment and to send to the server using JSON.

Col. Shrapnel
Page is in UTF-8.
topright
A: 

RFC 1738 states that only alphanumerics, the special characters $-_.+!*'()," and reserved characters ;/?:@=& are unencoded within a URL. Everything else is encoded by the HTTP client, i.e. Web browser. You can use rawurldecode() whether or not PHP automatically decodes the query string. There's no danger in double-decoding.

stillstanding