views:

119

answers:

3

I am pulling data from a website via NSURLConnection and stashing the received data away in an instance of NSMutableData. In the connectionDidFinishLoading delegate method the data is convert into a string with a call to NSString's appropriate method:

NSString *result = [[NSString alloc] initWithData:data 
                                     encoding:NSUTF8StringEncoding]

The resulting string turns out to be a null. If I use the NSASCIIStringEncoding, however, I do obtain the appropriate string, albeit with unicode characters garbled up as expected. The server's Content-Type header does not specify the UTF-8 encoding, but I have attempted a number of different websites with a similar scenario, and there string conversion happens just fine. It seems like the problem only pertains to the given web service but I have no clue why.

On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?

Much appreciated!

+2  A: 

The data might have been in another encoding of unicode, such as UTF16, or in some totally different encodings.

There're libraries which can guess the encoding used in a data, but that should be a last resort. If you're using a web service, that web service should have a documentation which says which encoding it uses. Look for it, or ask the provider of the web service which encoding it uses. If neither is available, you should try to get a sample data and determine the encoding for that, and use that in the program.

On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?

That depends on the size of the data. If it's small, that would be perfectly fine. If it's big, it would be better to deal with the data piecemeal.

Yuji
It is definitely UTF-8. It's almost like a certain character is causing it to freak out.
mitjak
Could you post the exact string which causes the problem? Maybe it's malformed, etc.
Yuji
This is so strange. It started working fine now.. I found another site it failed at, http://hypem.com. But that now also works fine.. I want to blame the simulator or my network somehow, but I honestly don't know..In general though, what could possibly cause such an error given it's not my device? Could a network failure possibly produce that, or would one of the proper delegate methods get called in case of an error? Thank you for sticking around to answer!
mitjak
I guess the data itself from the website is sometimes corrupt, due to a failure to convert to UTF8 to start with, etc. Encoding problems are very dear to me, coming from Japan where three encodings were competing each other. Gradual adoption of UTF8, although not perfect, is a real blessing to me.
Yuji
+2  A: 

You say that it “is definitely UTF-8”, but without a Content-Type header, you don't really know that. (And even if you did have a header saying that, it could still be wrong.)

My guess is that your data is usually ASCII, which always parses correctly as UTF-8, but you sometimes are trying to parse data that's actually encoded in ISO 8859-1 or Windows codepage 1252. Such data will generally be mostly ASCII, but with some bytes outside the 0–127 range ASCII defines. UTF-8 would expect such bytes to form a sequence of code units within a specified sequence of ranges, but in other encodings, any byte, regardless of value, is a complete character on its own. Trying to interpret non-ASCII non-UTF-8 data as UTF-8 will almost always get you either wrong results (wrong characters) or no results at all (cannot decode; decoder returns nil), because the data was never encoded in UTF-8 in the first place.

You should try UTF-8 first, and if it fails, use ISO 8859-1. If you're letting the user retrieve any web page, you should let them change the encoding you use to decode the data, in case they discover that it was actually 8859-9 or codepage-1252 or some other 8-bit encoding.

If you're downloading the data from a specific server, and especially if you have influence on what runs on that server, you should make it serve up an accurate Content-Type header and/or fix whatever bug is causing it to serve up text that isn't in UTF-8.

Peter Hosey
Well said. A good advice from a sage.
Yuji
This is probably the most full and complete answer. In the interests of those following in my steps googling for this question I shall make the answer available as the answer :). To sum up, it seems like decoding as UTF, and falling back to other encodings might be the best bet in case something happens.
mitjak
+2  A: 

The default encoding for HTTP if none is specified is ISO-8859-1. If the HTTP response is compliant to HTTP/1.1 and it's not specifying a character set encoding, that is the encoding it is using.

Try decoding the string with that NSISOLatin1StringEncoding.

JeremyP