ansaurus

Question

How to read unicode (utf-8) / binary file line by line

Answer 1

+1 A:

fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:

infile = fopen(inname, "r, ccs=UTF-8");

Hans Passant 2010-01-21 22:27:22

nobugz I can just use GCC/C99 in this project, any method to use your solution in GCC ? :) Thank You Nobugz.

Freeseif 2010-01-22 21:05:56

I doubt it, I don't know the gcc CRT well enough to know. Try it.

Hans Passant 2010-01-22 21:50:16

Answer 2

+2 A:

In this article a coding and decoding routine is written and it is explained how the unicode is encoded:

http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/

It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare

EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template:

What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but C is really stupid !) and finally
+ Find some other errors

#include <stdlib.h>
#include <string.h>

#define         MASKBITS                0x3F
#define         MASKBYTE                0x80
#define         MASK2BYTES              0xC0
#define         MASK3BYTES              0xE0
#define         MASK4BYTES              0xF0
#define         MASK5BYTES              0xF8
#define         MASK6BYTES              0xFC

char* UTF8Encode2BytesUnicode(unsigned short* input)
{
   int size = 0,
       cindex = 0;
   while (input[size] != 0)
     size++;
   // Reserve enough place; The amount of 
   char* result = (char*) malloc(size);
   for (int i=0; i<size; i++)
   {
      // 0xxxxxxx
      if(input[i] < 0x80)
      {
         result[cindex++] = ((char) input[i]);
      }
      // 110xxxxx 10xxxxxx
      else if(input[i] < 0x800)
      {
         result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
      // 1110xxxx 10xxxxxx 10xxxxxx
      else if(input[i] < 0x10000)
      {
         result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
         result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
   }
}

wchar_t* UTF8Decode2BytesUnicode(char* input)
{
  int size = strlen(input);
  wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
  int rindex = 0,
      windex = 0;
  while (rindex < size)
  {
      wchar_t ch;

      // 1110xxxx 10xxxxxx 10xxxxxx
      if((input[rindex] & MASK3BYTES) == MASK3BYTES)
      {
         ch = ((input[rindex] & 0x0F) << 12) | (
               (input[rindex+1] & MASKBITS) << 6)
              | (input[rindex+2] & MASKBITS);
         rindex += 3;
      }
      // 110xxxxx 10xxxxxx
      else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
      {
         ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
         rindex += 2;
      }
      // 0xxxxxxx
      else if(input[rindex] < MASKBYTE)
      {
         ch = input[rindex];
         rindex += 1;
      }

      result[windex] = ch;
   }
}

char* getUnicodeToUTF8(wchar_t* myString) {
  int size = sizeof(wchar_t);
  if (size == 1)
    return (char*) myString;
  else if (size == 2)
    return UTF8Encode2BytesUnicode((unsigned short*) myString);
  else
    return UTF8Encode4BytesUnicode((unsigned int*) myString);
}

Thorsten S. 2010-01-21 22:32:02

Thorsten S. Adjusted this long C++ function need a Advanced C/C++ programmer xD Thank You Thorsten S.

Freeseif 2010-01-22 21:20:17

Answer 3

+1 A:

I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.

EDIT:

Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).

Some RTFM: http://en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx http://stackoverflow.com/questions/1772321/what-is-xml-bom-and-how-do-i-detect-it

elcuco 2010-01-21 22:57:18

Alas, as UTF-*8* is a *byte* format, there is no need for a byte order mark. I'm so sorry that I spoiled your badness....

Thorsten S. 2010-01-21 23:10:11

elcuco Can you please explain to me more ? :) Thank You Elcuco.

Freeseif 2010-01-22 21:22:14

@Freeseif - read updated "answer"

elcuco 2010-01-22 22:00:51

@Thorsten S., just because there's no need for a byte order mark doesn't mean you won't get one. I just ran across one today, probably produced by Notepad. Wikipedia admits that it's possible to use one to mark a file as UTF-8, even though it's not recommended.

Mark Ransom 2010-01-22 22:20:23

Yep, there are UTF-8 files which use the BOM as marking, but many UTF-8 texts don't use it. We had a problem with UTF-16 encoding and compared it to UTF-8 output in a company library and we recognized the missing BOM. So we looked up the references and voila, you mayuse it, but you don't need to. As the programmer of the lib explained, there are many problems caused by a BOM because UTF-8 is rampant in internet apps (mail, usenet etc.) and you don't want strange characters appearing in the text. I object to "Most examples here will fail"

Thorsten S. 2010-01-22 23:34:22

@Thorsten S. most examples here do not even take under consideration this abomination called BOM. Since many editor "implants" those BOMs and save them into files, and some tools (PHP in my example) 'ignore' them and this meeses you up.I agree that it does not really answer the question, but sill this is something related, and might prevent people from properly read some UTF8 encoded files.

elcuco 2010-01-23 11:27:01

That mean we need ignore the first line ?in other way.. any good solution (simple) in C++ ?!

Freeseif 2010-01-23 12:41:35

No, you need to ignore the first three bytes of the very first line. But only if they're the bytes 0xEF 0xBB 0xBF, in that order; if they're not, you don't have a BOM and you use the full line. (If the files are saved with Notepad, you'll always get a BOM in UTF-8. Other editors vary.)

Michael Madsen 2010-01-25 20:39:59

Answer 4

+4 A:

A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.

robinr 2010-01-22 22:21:14

Answer 5

A:

just to settle the BOM argument. Here is a file from notepad

 [paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
 0000000 ef bb bf 61 0d 0a 62 0d 0a 63
 0000012

with a BOM at the start

Personally I dont think there should be a BOM (since its a byte format) but thats not the point

pm100 2010-01-23 01:01:00

Answer 6

A:

I find a solution to my problem, i want share the solution to any one interested by reading UTF-8 file in C99.

void ReadUTF8(FILE* fp)
{
    unsigned char iobuf[255] = {0};
    while( fgets((char*)iobuf, sizeof(iobuf), fp) )
    {
            size_t len = strlen((char *)iobuf);
            if(len > 1 &&  iobuf[len-1] == '\n')
                iobuf[len-1] = 0;
            len = strlen((char *)iobuf);
            printf("(%d) \"%s\"  ", len, iobuf);
            if( iobuf[0] == '\n' )
                printf("Yes\n");
            else
                printf("No\n");
    }
}

void ReadUTF16BE(FILE* fp)
{
}

void ReadUTF16LE(FILE* fp)
{
}

int main()
{
    FILE* fp = fopen("test_utf8.txt", "r");
    if( fp != NULL)
    {
        // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
        // encoding
        unsigned char b[3] = {0};
        fread(b,1,2, fp);
        if( b[0] == 0xEF && b[1] == 0xBB)
        {
            fread(b,1,1,fp); // 0xBF
            ReadUTF8(fp);
        }
        else if( b[0] == 0xFE && b[1] == 0xFF)
        {
            ReadUTF16BE(fp);
        }
        else if( b[0] == 0 && b[1] == 0)
        {
            fread(b,1,2,fp); 
            if( b[0] == 0xFE && b[1] == 0xFF)
                ReadUTF16LE(fp);
        }
        else
        {
            // we don't know what kind of file it is, so assume its standard
            // ascii with no BOM encoding
            rewind(fp);
            ReadUTF8(fp);
        }
    }        

    fclose(fp);
}

Freeseif 2010-01-25 20:33:46

ansaurus

tags:

views:

answers:

How to read unicode (utf-8) / binary file line by line

What i want

read_ansi_line_by_line.c

test_ansi.txt

Compiling

Output

What i want

create_bin.c

Compiling

Output

What i want

read_bin_line_by_line.c

Output

THE PROBLEM

related questions