How do I determine the character set of a string?

views:

1403

answers:

+4 Q:

How do I determine the character set of a string?

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?

+2 A:

Take a look at iconv

http://www.gnu.org/software/libiconv/

http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm

rebra 2008-11-25 22:27:46

+4 A:

If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.

As for your question how to convert between character sets: Encode is there to do that for you

Leon Timmermans 2008-11-25 22:37:34

+3 A:

erickson 2008-11-25 22:39:12

ansaurus

tags:

views:

answers:

How do I determine the character set of a string?

related questions