views:

6627

answers:

9

There are a couple of different ways to remove HTML tags from an NSString in Cocoa.

One way is to render the string into an NSAttributedString and then grab the rendered text.

Another way is to use NSXMLDocument's -objectByApplyingXSLTString method to apply an XSLT transform that does it.

Unfortunately, the iPhone doesn't support NSAttributedString or NSXMLDocument. There are too many edge cases and malformed HTML documents for me to feel comfortable using regex or NSScanner. Does anyone have a solution to this?

One suggestion has been to simply look for opening and closing tag characters, this method won't work except for very trivial cases.

For example these cases (from the Perl Cookbook chapter on the same subject) would break this method:

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
A: 

I would imagine the safest way would just be to parse for <>s, no? Loop through the entire string, and copy anything not enclosed in <>s to a new string.

Ben Gottlieb
+5  A: 

Take a look at NSXMLParser. It's a SAX-style parser. You should be able to use it to detect tags or other unwanted elements in the XML document and ignore them, capturing only pure text.

Colin Barrett
A: 

Here's a blog post that discusses a couple of libraries available for stripping HTML http://sugarmaplesoftware.com/25/strip-html-tags/ Note the comments where others solutions are offered.

micco
This is the exact set of comments that I linked to in my question as an example of what would not work.
lfalin
A: 
Menschel
The question specifically specified iPhone. It is rather trivial to do this on the Mac, as you stated above.
Sam Soffes
+1  A: 

Hi,

If you want to get the content without the html tags from the web page (html document) , then use this code inside the UIWebViewDidfinishLoading delegate method.

NSString *myText = [webView stringByEvaluatingJavaScriptFromString:@"document.documentElement.textContent"];

Biranchi
A: 

I would simply escape < and > by replacing them with &lt; and &gt;.

mouviciel
A: 

This post was really helpful if you've already parsed an XML and don't want to parse the content again.

lostInTransit
+2  A: 

Hi,

There is Good answer for this question. Flatten HTML using Objective c

vipintj
+1  A: 

If you are willing to use Three20 framework, it has a category on NSString that adds stringByRemovingHTMLTags method. See NSStringAdditions.h in Three20Core subproject.

jarnoan