A: 

You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.

Griffo
Yes, i was just thinking about this, but i cannot really think of this as a real solution... because there is the method resolveExternalEntityName:systemID for which the documentation says:"The delegate can resolve the external entity (for example, locating and reading an externally declared DTD) and provide the result to the parser object as an NSData object."So it should exists a way to use it to resolve the entity and translate it for the parser... Probably i'm missing something in the logic of NSXMLParser...
Roberto
you could try NSXMLDocument
Griffo
But i'm reading that NSXMLDocument is not available for iphone development, is it true?
Roberto
NSXMLDocument is available in TouchXML. See here: http://code.google.com/p/touchcode/wiki/TouchXML
sfa
Thank you, i'll try it for sure. But i cannot stop thinking about what is the correct way to handle this case using only the sdk code...
Roberto
A: 

I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.

The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.

The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.

James Raybould
Sorry... the html code in the example is just the first part of the file. That's my fault. The file has every tag correctly closed.
Roberto
A: 

I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.

Kendall Helmstetter Gelner
I read that libxml2 has an HTMLparser but i could not find a tutorial, documentation or example about this one, and this is why i first tried NSXMLParser.
Roberto
A: 

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, ', " and &

The code below fails resulting in an NSXMLParserUndeclaredEntityError.


// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent


NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave "à">
  <!ENTITY egrave "è">
]>

In another experiment, I created a completely valid xml document with an internal DTD

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
    <!ELEMENT author (#PCDATA)>
    <!ENTITY js "Jo Smith">
]>
<author>&lt; &js; &gt;</author>

I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.

2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model: 
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before: 
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: < 
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: < 
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document

I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.

NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandler getEntity callback. This is clearly not happening. I will file a bug report.

In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.

This has been fun.

falconcreek
I'll try this as soon as i'll be back home!!! Thank you!
Roberto
:( This did not work. It continue to raise a NSXMLParserUndeclaredEntityError = 26. :(I used your own code. It enters the method resolveExternalEntityName and then raise the exception...
Roberto
can you include the url? I have another theory that I would like to test.
falconcreek
You can find it here: https://dl.dropbox.com/u/1927123/testParseHtml.html
Roberto
Still looking for a solution. Found a possible answer http://www.cocoabuilder.com/archive/cocoa/218098-nsxmlparser-and-character-entities.html however it uses NSAttributedString which is not available on the current iPhone OS
falconcreek
Ouch :((In the meantime i tried TouchXml and read about other parsers... but it seems that this is a task you should do on your own. :\
Roberto
Wow! Your answer is really complete! You realy put everything in this, and i thank you.Great explanation. So the end of the story is that NSXMLParser sucks :)
Roberto
A: 

I am unsure as to what you are attempting to parse out of the HTML page, but this script may be useful.

http://www.biterscripting.com/helppages/SS_WebPageToText.html

I just tried it like this. I saved your sample HTML in file C:/test.html. Then, I started biterscripting and entered this command.

scr SS_WebpageToText.txt page("C:/test.html") 2>null

I got this output.

 morning something about you Bye Bye un saluto 

Is this what you are looking for ? Try that SS_WebPageToText script. It's pretty simple.

P M