ansaurus

Question

Linux & C-Programming: How can I write utf-8 encoded text to a file?

Answer 1

A:

Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.

Once you have done this, could you edit your original post to add the pertinent information?

Oliver N. 2009-02-09 22:22:52

yes, I did with hexdump -C

prinzdezibel 2009-02-09 22:24:34

Answer 2

+2 A:

Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.

Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.

To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.

Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:

iconv -f latin1 -t utf8 file.c

This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.

Regards,

Diego Sevilla 2009-02-09 22:23:03

how can I do this with libc without using another library?

prinzdezibel 2009-02-09 22:28:03

Well, yes, of course. as I say, just use an editor that supports UTF-8.

Diego Sevilla 2009-02-09 22:32:10

diegosevilla: I have umlauts in my source code and they are definitive written iso-8859 encoded. How can I force the program to write them out utf-8 encoded without the icu library?

prinzdezibel 2009-02-09 22:34:54

I recommend you to convert the source code instead. That will save you having to use another library, and will speed up the program (no conversions). Use iconv as I wrote in the answer.

Diego Sevilla 2009-02-09 22:38:28

Answer 3

+1 A:

Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.

Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.

If you are coming from an Unicode file I believe the function you looking for is wcstombs().

Augusto Radtke 2009-02-09 22:41:01

ansaurus

tags:

views:

answers:

Linux & C-Programming: How can I write utf-8 encoded text to a file?

related questions