views:

749

answers:

2

NSXMLParserInvalidCharacterError # 9

This is the error I get when I hit a weird character (like quotes copied and pasted from word to the web form, that end up in the feed). The feed I am using is not giving an encoding, and their is no hope for me to get them to change that. This is all I get in the header:

< ?xml version="1.0"?> < rss version="2.0">

What can I do about illegal characters when parsing feeds? Do I sweep the data prior to the parse? Is there something I am missing in the API? Has anyone dealt with this issue?

A: 

NSString *dataString = [[[NSString alloc] initWithData:webData encoding:NSASCIIStringEncoding] autorelease];

NSData *data = [dataString dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES]; NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];

Fixed my problems...

Chris Van Buskirk
+1  A: 

The NSString -initWithData:encoding: method returns nil if it fails, so you can try one encoding after another until you find one that converts. This doesn't guarantee that you'll convert all the characters correctly, but if your feed source isn't sending you correctly encoded XML, then you'll probably have to live with it.

The basic idea is:

// try the most likely encoding
NSString xmlString = [[NSString alloc] initWithData:xmlData 
                                           encoding:NSUTF8StringEncoding];

if (xmlString == nil) {
  // try the next likely encoding
  xmlString = [[NSString alloc] initWithData:xmlData 
                                     encoding:NSWindowsCP1252StringEncoding];
}

if (xmlString == nil) {
  // etc...
}

To be generic and robust, you could do the following until successful:

1.) Try the encoding specified in the Content-Type header of the HTTP response (if any)

2.) Check the start of the response data for a byte order mark and if found, try the indicated encoding

3.) Look at the first two bytes; if you find a whitespace character or '<' paired with a nul/zero character, try UTF-16 (similarly, you can check the first four bytes to see if you have UTF-32)

4.) Scan the start of the data looking for the <?xml ... ?> processing instruction and look for encoding='something' inside it; try that encoding.

5.) Try some common encodings. Definitely check Windows Latin-1, Mac Roman, and ISO Latin-1 if your data source is in English.

6.) If none of the above work, you could try removing all bytes greater than 127 (or substitute '?' or another ASCII character) and convert the data using the ASCII encoding.

If you don't have an NSString by this point, you should fail. If you do have an NSString, you should look for the encoding declaration in the <?xml ... ?> processing instruction (if you didn't already in step 4). If it's there, you should convert the NSString back to NSData using that encoding; if it's not there, you should convert back using UTF-8 encoding.

Also, the CFStringConvertIANACharSetNameToEncoding() and CFStringConvertEncodingToNSStringEncoding() functions can help get the NSStringEncoding that goes with the encoding name form the Content-Type header or the <?xml ... ?> processing instruction.

Don McCaughey