You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.
I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.
The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.
The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.
I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.
After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, ', " and &
The code below fails resulting in an NSXMLParserUndeclaredEntityError
.
// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys:
[NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
[NSString stringWithFormat:@"%C", 0x00E0], @"agrave",
...
,nil];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];
// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}
Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters
and the è and à characters are dropped.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
<!ENTITY agrave "à">
<!ENTITY egrave "è">
]>
In another experiment, I created a completely valid xml document with an internal DTD
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
<!ELEMENT author (#PCDATA)>
<!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>
I implemented the parser:foundInternalEntityDeclarationWithName:value:;
delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters
is only called for the pre-defined entities.
2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model:
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before:
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document
I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler
that is used by NSXMLParser
allows for a getEntity
callback to be defined. After calling getEntity
, the expansion of the entity is passed to the characters
callback.
NSXMLParser
is missing functionality here. What should happen is that the NSXMLParser
or its delegate
store the entity definitions and provide them to the xmlSAXHandler
getEntity
callback. This is clearly not happening. I will file a bug report.
In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml
parser on your own is worthwhile.
This has been fun.
I am unsure as to what you are attempting to parse out of the HTML page, but this script may be useful.
http://www.biterscripting.com/helppages/SS_WebPageToText.html
I just tried it like this. I saved your sample HTML in file C:/test.html. Then, I started biterscripting and entered this command.
scr SS_WebpageToText.txt page("C:/test.html") 2>null
I got this output.
morning something about you Bye Bye un saluto
Is this what you are looking for ? Try that SS_WebPageToText script. It's pretty simple.