ansaurus

Question

Answer 1

A:

I'm not certain that this is the issue. I've had a very similar experiance where the file was encoded as UTF-8, but the xml header claimed it to be UTF-16.

As a result of the mismatch I was unable to parse it with the same error you had. However, changing the xml header from UTF-16 to UTF-8 fixed my issue for me.

You may be experiencing a similar issue.

Bryan McLemore 2010-01-14 18:31:48

The header says:<?xml version="1.0" encoding="utf-8"?>.If i save this xml in a file and open it with BBEdit, i see that it has utf-8 encoding no BOM. What i see however in Resourcerer when i open this file, is the BOM sequence after closing '>' of the header.My question was how do i get rid of this?and i see in BBEdit, that

Nava Carmon 2010-01-14 20:02:27

Answer 2

A:

Well, may be this is not the best approach to get rid of BOM bytes, but it works. For those who spent hours like me trying to make NSXMLParser to swallow BOMs: Given, that you get your data through NSURLConnection and store it in NSMutableData *webData.

    const char bom[3] = {0xEF, 0xBB, 0xBF};

char *data = [webData mutableBytes];
char *cp = data, *pp;
long lessBom = 0;
do {
    cp = strstr((const char *)cp, (const char *)bom);
    if (cp) {
        pp = cp;
        cp += 3;
        memcpy(pp, cp, strlen(cp));
        lessBom += 3;
    }
} while (cp != NULL);

NSMutableData   *newData = [[NSMutableData alloc] initWithBytes:data length:webData.length - lessBom];

Then you create your parser with newData and it JUST WORKS! I'll be glad to get any comments/improvements to this code

Nava Carmon 2010-01-15 07:36:02

Definitely do not use `strstr` here. That's for C strings, which are null-terminated (the last byte is 0). The contents of an NSMutableData are not null-terminated unless you do this yourself, and can contain null bytes, the first of which `strstr` and other C-string functions will treat as the terminator. NSData and NSMutableData have methods that can do the same job much more safely; see their documentation for details.

Peter Hosey 2010-01-16 07:54:56

Thanks, I thought about that, though I can add a '\0' in the end. Will re-factor it anyway.

Nava Carmon 2010-01-16 16:28:56

Answer 3

A:

I tried to create a string with those BOM bytes like that:

const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason.

Make sure you gave the correct encoding when instantiating noBOMString. If the document data was UTF-8, make sure you instantiated the string as UTF-8. Likewise, if the data was UTF-16, make sure you instantiated the string as UTF-16.

If you pass the wrong encoding, either the string won't instantiate at all (I'm assuming that isn't your problem) or some characters will be wrong. The BOM would be one of these: If the input is UTF-8 and you interpret it as MacRoman or ISOLatin1, it'll appear in the string as three separate characters. These three separate characters won't compare equal to the single character that is the BOM.

Peter Hosey 2010-01-16 08:14:57

yes, theResult string was instantiated with NSUTF8StringEncoding, so i suppose the way of checking BOM as a 3-char string was right. The fact is that the c code works. AFAIK there are different BOM sequences for different encodings. So how would you suggest to check it? Is it possible using cocoa strings?

Nava Carmon 2010-01-16 16:26:32

There is only one BOM: U+FEFF. It appears as different sequences of bytes in different encodings because different encodings encode the same characters as different bytes. Creating `BOMString` from UTF-8 is one way, but it doesn't matter which UTF you create it from, because (as long as you give the right code units) it will always be U+FEFF in the resulting string. Your code should work just fine; you might try dumping `theResult` to a file before and after `stringByReplacing…` and viewing it with a hex editor such as Hex Fiend.

Peter Hosey 2010-01-16 16:48:16

I did this an clearly saw 2 char sequences 0xEF, 0xBB, 0xBF. So I don't understand why stringByReplacingOccurrencesOfString didn't work for me. I tried to log bomString, but probably these characters are invisible. Can you give an example of working Cocoa code? TIA

Nava Carmon 2010-01-17 04:17:22

\*forehead-smack\* I just realized why it doesn't work—because the UTF-8 data you passed starts with the UTF-8 BOM! Cocoa will, obviously, strip off (and use, if appropriate) a BOM that appears at the start of the input data. You don't have any other characters after the BOM; thus, you're replacing a string-with-no-characters with a space, and it doesn't find no characters anywhere in the document string. So, you need to create the BOM string differently. `[NSString stringWithFormat:@"%C", 0xFEFF]` works.

Peter Hosey 2010-01-17 05:10:45

Thanks a lot, it worked, indeed! How could I know this in the first place? Is there a doc where i can learn about it?

Nava Carmon 2010-01-17 07:04:12

Read the Unicode Standard: http://www.unicode.org/standard/standard.html

Peter Hosey 2010-01-17 16:51:15

ansaurus

tags:

views:

answers:

NSXMLParser and BOM bytes

related questions