views:

400

answers:

3

Hi, I've working on a web application, that should be able to accept tags and search queries in multiple languages. That's not asking too much, is it?

Now, on my development MAMP server everything is great. I add multilingual tags, search in any language I want etc. On the other hand, on the production WAMP server, multilingual character give trouble. And not even all the time, just some of the time, or some of the characters, I'm not sure yet. What happens is that they get extra characters and then their url decoding isn't proper.

Both Environments use PHP 5, mysql and Apache.

My guess is that I got a setting wrong somewhere.

Any Ideas?

Thanks, Omer.

  • update: I'm now sure it's particular letters (the hebrew ל,מ א for example)

  • update: easily reproducible: always the same letters get wrong encoding.

  • content type is "text/html; charset=utf-8"

Also, I've pinpointed it a bit further:
I use the search string: ליבני
On the results page I see this:

  • In the address bar the search phrase is correct, properly url-encoded.
  • In the HTML itself I see the string "�_יבני" which is "%D7_%D7%99%D7%91%D7%A0%D7%99" which means the ל has been encoded to "%D7_" instead of "%D7%9C" as it should have.

I don't really know where to go further.
Any ideas? anyone?

A: 

I recommend you to use UTF-8 for internal and external encoding. Use the AddDefaultCharset directive to tell Apache your default encoding:

AddDefaultCharset utf-8

Now you just need to ensure that you application handles the data correctly (see default_charset directive). If you use UTF-8 for your output, the client should use this for further requests (URLs, form data) as well.

Gumbo
thanks,but it didn't solve my problem.
Omer
+1  A: 

Charsets are rally a simple concept. The confusing thing about them, is that there are multiple levels where it must be done correctly. If you mess up in one place, it will usually show in a completely different place.

So the slightly condescending, but also very true answer to your problem is that you need to know what you're doing, instead of just poking at it with a stick until it kind of looks okay.

I recommend the following reading:

troelskn
A: 

It turns out the problem is somewhere within PHP's parse_url(). I guess that in some versions, on some platforms, parse_url() doesn't handle UTF characters correctly. It was spotted on windows at least one more time.

I was able to workaround it for now.

Thanks for everybody's time and attention, Omer.

Omer