views:

259

answers:

1

Hi,

I'm trying to parse an XHTML document using TBXML on the iPhone (although I would be happy to use either libxml2 or NSXMLParser if it would be easier). I need to extract the content of the body as a series of paragraphs and maintain the inline tags, for example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
    <head>
       <title>Title</title>
       <link rel="stylesheet" href="css/style.css" type="text/css"/>
       <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    </head>
    <body>
       <div class="body">
          <div>
             <h3>Title</h3>
             <p>Paragraph with <em>inline</em> tags</p>
             <img src="image.png" />
          </div>
       </div>
    </body>
</html>

I need to extract the paragraph but maintain the <em>inline</em> content with the paragraph, all my testing so far has extracted that as a subelement without me knowing exactly where it fitted in the paragraph.

Can anyone suggest a way to do this?

Thanks.

+1  A: 

Assumption 1. You are only interested in the data in the p (paragraph) element and that you are using NSXMLParser.

Assumption 2. You want to keep any element inside of p intact.

The strategy that you want to use is to create a state machine for your parser so that it knows when it needs to save data and when to ignore data as it is received.

Set up your NSXMLParser delegate using the sample code from Apple. Your delegate will need an ivar BOOL inParagraph for tracking when data will be retained or discarded. The initial value of inParagaph is NO. When your delegate receives the parser:didStartElement:namespaceURI:qualifiedName:attributes: message, if ([element isEqual:@"p"]) clear your receivedData variable and set inParagraph = YES

EDIT: receivedData is an NSMutableString. Fixed the code examples

At this point your parser delegate wants to save data received.

When the parser delegate receives the parser:foundCharacters: message, append the string to receivedData as in the sample code.

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string
{
    if (inParagraph) [receivedData appendString:string];
}

When the parser encounters the inline element, the delegate will receive the parser:didStartElement:namespaceURI:qualifiedName:attributes: again. This is when the inParagraph state variable is important. The parser will not receive the enclosing '<' and '>' characters of an element, so you will have to wrap the elementName in the '<' and '>' characters and add to receivedData. Something like

- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qualifiedName attributes:(NSDictionary *)attributeDict
{ if (inParagraph) 
    {
        NSString *inlineElementName = [NSString stringWithFormat:@"<%@>", elementName];
        [receivedData appendString:inlineElementName];
    }
....
}

When the parser delegate receives the parser:didEndElement:namespaceURI:qualifiedName: message, it checks whether it is in the "p" element, if (inParagraph && ![elementName isEqual:@"p"], close the inline element. if ([elementName isEqual:@"p"]) add the contents of receivedData to the NSMutableArray holding your paragraphs.

- (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName
{
   if (inParagraph)
        {
             if (![elementName isEqual:@"p"])
                 { 
                      NSString *inlineElementName = [NSString stringWithFormat:@"</%@>", elementName];
                     [receivedData appendString:inlineElementName];             
                 } else { // received closing </p> tag add receivedData to the paragraph array
                          [paragraphsArray addObject:[receivedData copy]];
                          [self setInParagraph:NO];
                         }
                 }
       }
}
falconcreek
That was exactly what I was looking for! Thanks!