views:

6395

answers:

8

First of all, I found this: http://stackoverflow.com/questions/659602/objective-c-html-escape-unescape, but it doesn't work for me.

My encoded characters (come from a RSS feed, btw) look like this: &

I searched all over the net and found related discussions, but no fix for my particular encoding, I think they are called hexadecimal characters.

Thanks.

+10  A: 

Those are called Character Entity References. When they take the form of &#<number>; they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &#038;, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &.

The reason the ampersand has to be encoded in RSS is it's a reserved special character.

What you need to do is parse the string and replace the entities with a byte matching the value between &# and ;. I don't know of any great ways to do this in objective C, but this stack overflow question might be of some help.

Matt Bridges
+1 I was just about to submit the exact same answer (including the same links, no less!)
e.James
“Basically, it's a string representation of the byte that should be substituted.” More like character. This is text, not data; upon converting the text to data, the character may occupy multiple bytes, depending on the character and the encoding.
Peter Hosey
treznik
Matt Bridges
+9  A: 

I ought to post this on GitHub or something. This goes in a category of NSString, uses NSScanner for the implementation, and handles both hex and decimal numeric character entities as well as the usual symbolic ones.

Also, it handles malformed strings (when you have an & followed by an invalid sequence of characters) relatively gracefully, which turned out to be crucial in my released app that uses this code.

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];
    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }
            if (gotNumber) {
                [result appendFormat:@"%C", charCode];
            }
            else {
                NSString *unknownEntity = @"";
                [scanner scanUpToString:@";" intoString:&unknownEntity];
                [result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
            }
            [scanner scanString:@";" intoString:NULL];
        }
        else {
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
        }
    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Daniel Dickison
Very useful piece of code, however it does have a couple of issues that were addressed by Walty. Thanks for sharing!
Michael Waterfall
+6  A: 

The one by Daniel is basically very nice, and I fixed a few issues there:

  1. removed the skipping character for NSSCanner (otherwise spaces between two continuous entities would be ignored

    [scanner setCharactersToBeSkipped:nil];

  2. fixed the parsing when there are isolated '&' symbols (I am not sure what is the 'correct' output for this, I just compared it against firefox):

e.g.

    &#ABC DF & B&#39;  & C&#39; Items (288)

here is the modified code:

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", charCode];

       [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

       [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


       [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
      NSString *amp;

      [scanner scanString:@"&" intoString:&amp]; //an isolated & symbol
      [result appendString:amp];

      /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
       */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Walty
Excellent, works perfectly! Nice changes!
Michael Waterfall
+1  A: 

This is the way I do it using RegexKitLite framework:

-(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html {
NSString* result = [html copy];
NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"];

if (![matches count]) 
    return result;

for (int i=0; i<[matches count]; i++) {
    NSArray* array = [matches objectAtIndex: i];
    NSString* charCode = [array objectAtIndex: 1];
    int code = [charCode intValue];
    NSString* character = [NSString stringWithFormat:@"%C", code];
    result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0]
                                               withString: character];      
}   
return result;  

}

Hope this will help someone.

realsugar
+1  A: 

Hi guys you can use just this function to solve this problem.

+ (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str
{
    NSMutableString* string = [[NSMutableString alloc] initWithString:str];  // #&39; replace with '
    NSString* unicodeStr = nil;
    NSString* replaceStr = nil;
    int counter = -1;

    for(int i = 0; i < [string length]; ++i)
    {
        unichar char1 = [string characterAtIndex:i];    
        for (int k = i + 1; k < [string length] - 1; ++k)
        {
            unichar char2 = [string characterAtIndex:k];    

            if (char1 == '&'  && char2 == '#' ) 
            {   
                ++counter;
                unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)];    
                // read integer value i.e, 39
                replaceStr = [string substringWithRange:NSMakeRange (i, 5)];     //     #&39;
                [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]];
                break;
            }
        }
    }
    [string autorelease];

    if (counter > 1)
        return  [self decodeHtmlUnicodeCharactersToString:string]; 
    else
        return string;
}
Krishna Gupta
+10  A: 

Check out my NSString category for HTML. Here are the methods available:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
Michael Waterfall
Dude, excellent functions. Your stringByDecodingXMLEntities method made my day! Thanks!
bmoeskau
No problem ;) Glad you found it useful!
Michael Waterfall
Kudos! Great Additions!
Prakash
This is amazing. stringByDecodingXMLEntities is a real timesaver. Thanks!
Mike A
A: 
Fernando Redondo
Fernando Redondo