Delphi, charset detection ([Uni]SynEdit) - Utf8Decode problem

views:

720

answers:

+1 Q:

Delphi, charset detection ([Uni]SynEdit) - Utf8Decode problem

I'm using Unicode SynEdit, which (in theory) has basic file/stream encoding detection. It worked fine until I tried opening the file which was generated by my PHP script. The file I'm talking about is detected by UniSynEdit as utf8 with no BOM. Unfortunately, it doesn't open - the loaded string is empty. I debugged it, and it seems that the problem is the function Utf8Decode, which fails for some reason and returns empty string. I've also checked the file with hex editor, and it's true: it has no BOM, all the normal characters are encoded using one byte, while some polish letters I had in the file (like "ł") are double-byte...

What could be wrong, and how can I prevent this? I believe wrong encoding loaded is better than no file at all...

+2 A:

If you really want to load files that are not correctly UTF-8 encoded, then you need to use a function that does not return an empty result for a string containing invalid byte sequences, but does instead replace them a replacement character. See the Wikipedia entry on UTF-8, in particular the section on "Invalid byte sequences".

Unfortunately the Delphi 2009 (don't have Delphi 7 to check there) UTF8Decode() calls MultibyteToWideChar(CP_UTF8, ...) internally, which returns an error on invalid byte sequences.

What you'd have to do is to use an alternative encoding function. Maybe there's something in one of the third party Delphi libraries that have their own decoding functions. Maybe you could use one of the linked libraries here. If all else fails you could write your own, maybe based on this code from the Unicode consortium.

mghie 2009-09-25 20:20:44

BTW: If you didn't even *mean* your PHP script to create a UTF-8 file - think again about that. It should, and preferably valid UTF-8 :-)

mghie 2009-09-25 20:24:43

that's not a point :) actually thanks to that I found my app failing at some point, so far I was sure it handles all the **valid** files. and it does. But I had no chance to test again those invalid ;)

migajek 2009-09-25 23:30:38

Thanks, it seems that using UTF8StringToWideString from cUnicodeCodecs (Delphi Fundamentals) works fine :]

migajek 2009-09-26 09:13:17

ansaurus

tags:

views:

answers:

Delphi, charset detection ([Uni]SynEdit) - Utf8Decode problem

related questions