views:

432

answers:

2

I am usin the nsxmlparser and am wondering how I can parse ISO-8859-1 correctly into an NSString.

Currently, I am getting results w/ Â for two-byte characters.

The XML I'm using (not created by me) starts with <?xml version="1.0" encoding="ISO-8859-1"?>

Here are the basic calls I'm using (omitted the NSThread calls).

NSString *xmlFilePath = [[NSBundle mainBundle] pathForResource:sampleFileName ofType:@"xml"];

NSString *xmlFileContents = [NSString stringWithContentsOfFile:xmlFilePath encoding:NSUTF8StringEncoding error:nil];

NSData *data = [xmlFileContents dataUsingEncoding:NSUTF8StringEncoding];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];

[parser setDelegate:self];

[parser parse];
+1  A: 

The XML specification recommends an explicit character encoding declaration in the document prolog. Your input document likely has one; that will tell you the encoding that the parser must use to interpret the character input.

In the absence of an explicit declaration, the same section says to treat the input as UTF-8 or UTF-16 (and it's an error if the document turns out not to be encoded as either of those).

So, if your XML parser is either ignoring the explicit encoding declaration, or using the wrong encoding if there's no explicit declaration, your parser is Doing It Wrong™ and needs to be fixed to conform to the XML specification.

bignose
Ah ok. That makes sense. Sorry I'm a bit new at this. So at the top of my XML document is the line <?xml version="1.0" encoding="ISO-8859-1"?>. That's the encoding right? So I have to tell NSXmlParser this?
Travis
Note that the XML spec doesn't require the parser to understand anything other than UTF-8 and UTF-16 (section 2.2). I have never used the XML parser in question, so I don't know for sure, but it could be the case that NSXmlParser doesn't support anything beyond that.
Michael Madsen
A: 

Looks like your header thinks it's ISO-8859-1 and from the behavior (ending up with two characters instead of one) it sounds like at least some of your content is already UTF-8. This looks like a classic "double utf-8 encoding issue" where content already encoded as UTF-8 is encoded again as UTF-8. Change the header to say UTF-8 and it just might start working. You could try always running your code through as UTF-8 and then as what it says it is (since if it's not UTF-8 you'll get a parser error).

Finally note that the encoding of an XML file is overridden by the HTTP header if it's served over HTTP.

Not sure if it applies to your need, but I love this article on parsing XML at all costs. As an example, I'll point out that I also love feedparser (Python) as its the best XML at-all-costs XML parser (great for ideas but not your situation).

Epsilon Prime
Great info thank you. So if I have an http link to an XML file what is the easy way to get that file downloaded locally for me to look at without HTTP modifying it? I tried in Safari but haven't found it yet.
Travis
For debugging purposes I tend to use either `curl` or `wget` on the command line and tell them to show me the headers. In a browser I'll use Firefox along with an extension like FireBug to show the headers. For viewing the content in the browser I just right click on it and select "View Source".
Epsilon Prime