views:

322

answers:

3

Hi,

I'm getting my xml file as a result of a php query from some server. When I print the resulting data to the console I'm getting well-structured xml file. When I try to parse it using NSXMLParser it returns NSXMLParserErrorDomain with code 4 - empty document. I saw that xmls that it couldn't parse have BOM (Byte order mark) sequence right after closing '>' mark of xml header. The question is how to get rid of BOM sequence. I tried to create a string with those BOM bytes like that:

    const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason. There are xmls, that have this sequence after the root element. In this case NSXMLParser parses the xml successfully. Safari ignores those characters. So Xcode debugger. Please help!

Thanks,

Nava

A: 

I'm not certain that this is the issue. I've had a very similar experiance where the file was encoded as UTF-8, but the xml header claimed it to be UTF-16.

As a result of the mismatch I was unable to parse it with the same error you had. However, changing the xml header from UTF-16 to UTF-8 fixed my issue for me.

You may be experiencing a similar issue.

Bryan McLemore
The header says:<?xml version="1.0" encoding="utf-8"?>.If i save this xml in a file and open it with BBEdit, i see that it has utf-8 encoding no BOM. What i see however in Resourcerer when i open this file, is the BOM sequence after closing '>' of the header.My question was how do i get rid of this?and i see in BBEdit, that
Nava Carmon
A: 

Well, may be this is not the best approach to get rid of BOM bytes, but it works. For those who spent hours like me trying to make NSXMLParser to swallow BOMs: Given, that you get your data through NSURLConnection and store it in NSMutableData *webData.

    const char bom[3] = {0xEF, 0xBB, 0xBF};

char *data = [webData mutableBytes];
char *cp = data, *pp;
long lessBom = 0;
do {
    cp = strstr((const char *)cp, (const char *)bom);
    if (cp) {
        pp = cp;
        cp += 3;
        memcpy(pp, cp, strlen(cp));
        lessBom += 3;
    }
} while (cp != NULL);

NSMutableData   *newData = [[NSMutableData alloc] initWithBytes:data length:webData.length - lessBom];

Then you create your parser with newData and it JUST WORKS! I'll be glad to get any comments/improvements to this code

Nava Carmon
Definitely do not use `strstr` here. That's for C strings, which are null-terminated (the last byte is 0). The contents of an NSMutableData are not null-terminated unless you do this yourself, and can contain null bytes, the first of which `strstr` and other C-string functions will treat as the terminator. NSData and NSMutableData have methods that can do the same job much more safely; see their documentation for details.
Peter Hosey
Thanks, I thought about that, though I can add a '\0' in the end. Will re-factor it anyway.
Nava Carmon
A: 

I tried to create a string with those BOM bytes like that:

const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason.

Make sure you gave the correct encoding when instantiating noBOMString. If the document data was UTF-8, make sure you instantiated the string as UTF-8. Likewise, if the data was UTF-16, make sure you instantiated the string as UTF-16.

If you pass the wrong encoding, either the string won't instantiate at all (I'm assuming that isn't your problem) or some characters will be wrong. The BOM would be one of these: If the input is UTF-8 and you interpret it as MacRoman or ISOLatin1, it'll appear in the string as three separate characters. These three separate characters won't compare equal to the single character that is the BOM.

Peter Hosey
yes, theResult string was instantiated with NSUTF8StringEncoding, so i suppose the way of checking BOM as a 3-char string was right. The fact is that the c code works. AFAIK there are different BOM sequences for different encodings. So how would you suggest to check it? Is it possible using cocoa strings?
Nava Carmon
There is only one BOM: U+FEFF. It appears as different sequences of bytes in different encodings because different encodings encode the same characters as different bytes. Creating `BOMString` from UTF-8 is one way, but it doesn't matter which UTF you create it from, because (as long as you give the right code units) it will always be U+FEFF in the resulting string. Your code should work just fine; you might try dumping `theResult` to a file before and after `stringByReplacing…` and viewing it with a hex editor such as Hex Fiend.
Peter Hosey
I did this an clearly saw 2 char sequences 0xEF, 0xBB, 0xBF. So I don't understand why stringByReplacingOccurrencesOfString didn't work for me. I tried to log bomString, but probably these characters are invisible. Can you give an example of working Cocoa code? TIA
Nava Carmon
\*forehead-smack\* I just realized why it doesn't work—because the UTF-8 data you passed starts with the UTF-8 BOM! Cocoa will, obviously, strip off (and use, if appropriate) a BOM that appears at the start of the input data. You don't have any other characters after the BOM; thus, you're replacing a string-with-no-characters with a space, and it doesn't find no characters anywhere in the document string. So, you need to create the BOM string differently. `[NSString stringWithFormat:@"%C", 0xFEFF]` works.
Peter Hosey
Thanks a lot, it worked, indeed! How could I know this in the first place? Is there a doc where i can learn about it?
Nava Carmon
Read the Unicode Standard: http://www.unicode.org/standard/standard.html
Peter Hosey