views:

35

answers:

2

Hi guys,

I'm working on a WebDAV implementation for PHP. In order to make it easier for Windows and other operating systems to work together, I need jump through some character encoding hoops.

Windows uses ISO-8859-1 in it's HTTP request, while most other clients encode anything beyond ascii as UTF-8.

My first approach was to ignore this altogether, but I quickly ran into issues when returning urls. I then figured it's probably best to normalize all urls.

Using ü as an example. This will get sent over the wire by OS/X as

u%CC%88 (this is codepoint U+0308)

Windows sents this as:

%FC (latin1)

But, doing a utf8_encode on %FC, I get :

%C3%BC (this is codepoint U+00FC)

Should I treat %C3%BC and u%CC%88 as the same thing? If so.. how? Not touching it seems to work OK for windows. It somehow understands that it's a unicode character, but updating the same file throws an error (for no particular reason).

I'd be happy to provide more information.

+1  A: 

Mac stores unicode chars as "decomposed", that is, "u" + ¨ (diaresis) instead of "ü". Normalizer can take care of that. If you don't have Normalizer, try iconv('UTF8-MAC', 'UTF8', $str)

stereofrog
I did not know about UTF8-MAC. I was looking for documentation around finding out which encodings are available, but I couldn't find it.Any idea where I would have been able to find UTF8-MAC ?
Evert
on my system (osx 10.6) "iconv --list" shows 'UTF8-MAC' among others, but the above code doesn't work. Strange.
stereofrog
+1  A: 

I hate answering my own questions, but here goes.

I ended up not bothering. Did extensive research on how various operating systems encode, and handle encodings. Turns out that in most cases other os's handle paths using other normalization forms alright. Windows worked a bit shitty though, but it works.

Whenever I receive a path that's actually non-utf8 altogether, I try to detect the encoding and convert it to UTF-8.

Evert