views:

448

answers:

3

Hi,

When I fetch the source of any web page, no matter the encoding I use, I always end up with &# - characters (such as © or ®) instead of the actual characters themselves. This goes for foreign characters as well (such as åäö in swedish), which I have to parse from "&Aring" and such).

I'm using

+stringWithContentsOfUrl: encoding: error; 

to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.

Any ideas / tips / solution is greatly appreciated! I'd rather not have to implement the entire ASCII table and replace all occurrances of every character... Thanks in advance!

Regards

A: 

Are you sure they originally are not in Å form? Try to view the source code in a browser first.

KennyTM
The web page looks fine, but I have to believe there is a better way than this: http://stackoverflow.com/questions/659602/objective-c-html-escape-unescape
To clarify, the web page source displays -characters, but I want their equivalent in the NSString (as displayed in a web browser).
@user: If they are originally in `Å` form and you want to convert them into `Å` then no, there's nothing better than that.
KennyTM
A: 

That really, really sucks. I wanted to convert it directly and the above solution isn't really a good one, so I just wrote my own ascii-table converter (static) class. Works as it should have worked natively (though I have to fill in the ascii table myself...)

Ideas for optimization? ("ASCII" is a static NSDictionary)

@implementation InternetHelper

+(NSString *)HTMLSourceFromUrlWithString:(NSString *)str convertASCII:(BOOL)state
{
    NSURL *url = [NSURL URLWithString:str];
    NSString *source = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];

    if (state)
        source = [InternetHelper ConvertASCIICharactersInString:source];

    return source;
}

+(NSString *)ConvertASCIICharactersInString:(NSString *)str
{
    NSString *ret = [NSString stringWithString:str];

    if (!ASCII)
    {
        NSString *path = [[NSBundle mainBundle] pathForResource:kASCIICharacterTableFilename ofType:kFileFormat];
        ASCII = [[NSDictionary alloc] initWithContentsOfFile:path];
    }

    for (id key in ASCII)
    {
        ret = [ret stringByReplacingOccurrencesOfString:key withString:[ASCII objectForKey:key]];
    }

    return ret;
}       

@end
ASCII does not mean what you seem to think it means. It is an encoding (and a very small one at that); it has nothing to do with SGML or XML entity references. Moreover, there is a simpler, easier way to do this; see my answer.
Peter Hosey
+1  A: 

I'm using

+stringWithContentsOfUrl: encoding: error; 

to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.

You're misunderstanding the purpose of that encoding: argument. The method needs to convert bytes into characters somehow; the encoding tells it what sequences of bytes describe which characters. You need to make sure the encoding matches that of the resource data.

The entity references are an SGML/XML thing. SGML and XML are not encodings; they are markup language syntaxes. stringWithContentsOfURL:encoding:error: and its cousins do not attempt to parse sequences of characters (syntax) in any way, which is what they would have to do to convert one sequence of characters (an entity reference) into a different one (the entity, in practice meaning single character, that is referenced).

You can convert the entity references to un-escaped characters using the CFXMLCreateStringByUnescapingEntities function. It takes a CFString, which an NSString is (toll-free bridging), and returns a CFString, which is an NSString.

Peter Hosey
Thanks, I'll check that out.