tags:

views:

22150

answers:

10

Is there a good way to remove HTML from a Java string? A simple regex like

 replaceAll("\\<.*?>","")

will work, but things like

&amp;

wont be converted correctly and non-HTML between the two angle brackets will be removed (ie the .*? in the regex will disappear).

+3  A: 

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browser have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTMl special characters to keep your output safe.

Chris Marasti-Georg
+8  A: 

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

Tim Howland
This is the sort of thing I'm looking for but I want to strip the HTML instead of escaping it.
Mason
do you want to strip the html, or do you want to convert it to plain text? Stripping the HTML from a long string with br tags and HTML entities can result in an illegible mess.
Tim Howland
+2  A: 

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

replaceAll("\\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

foxy
+1  A: 

It sounds like you want to go from HTML to plain text. If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL. It makes use of org.htmlparser.beans.StringBean.

    static public String getUrlContentsAsText(String url) {
      String content = "";
      StringBean stringBean = new StringBean();
      stringBean.setURL(url);
      content = stringBean.getStrings();
      return content;
     }
+10  A: 

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the third parameter is TRUE to ignore charset directive
   delegator.parse(in, this, Boolean.TRUE);
 }

 public void handleText(char[] text, int pos) {
   s.append(text);
 }

 public String getText() {
   return s.toString();
 }

 public static void main (String[] args) {
   try {
     // the HTML to convert
     FileReader in = new FileReader("java-new.html");
     Html2Text parser = new Html2Text();
     parser.parse(in);
     in.close();
     System.out.println(parser.getText());
   }
   catch (Exception e) {
     e.printStackTrace();
   }
 }
}

ref : Remove HTML tags from a file to extract only the TEXT

RealHowTo
The result of "a < b or b > c" is "a b or b > c", which seems unfortunate.
dfrankow
+1  A: 

Thanks RealHowTo! Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide. Cheers!

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback
{
  private static final Logger log = Logger.getLogger(Logger.GLOBAL_LOGGER_NAME);

  private StringBuffer stringBuffer;

  private Stack<IndexType> indentStack;

  public static class IndexType
  {
    public String type;
    public int counter; // used for ordered lists
    public IndexType(String type)
    {
      this.type = type;
      counter = 0;
    }
  }

  public HTML2Text()
  {
    stringBuffer = new StringBuffer(); 
    indentStack = new Stack<IndexType>();
  }

  public static String convert(String html)
  {
    HTML2Text parser = new HTML2Text();
    Reader in = new StringReader(html);
    try
    {
      // the HTML to convert
      parser.parse(in);
    }
    catch (Exception e)
    {
      log.severe(e.getMessage());
    }
    finally
    {
      try
      {
        in.close();
      }
      catch (IOException ioe)
      {
        // this should never happen
      }
    }
    return parser.getText();    
  }

  public void parse(Reader in) throws IOException
  {
    ParserDelegator delegator = new ParserDelegator();
    // the third parameter is TRUE to ignore charset directive
    delegator.parse(in, this, Boolean.TRUE);
  }

  public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
  {
    log.info("StartTag:" + t.toString());
    if (t.toString().equals("p"))
    {
      if (stringBuffer.length() > 0 && !stringBuffer.substring(stringBuffer.length() - 1).equals("\n"))
      {
        newLine();
      }
      newLine();
    }    
    else if (t.toString().equals("ol"))
    {
      indentStack.push(new IndexType("ol"));
      newLine();
    }    
    else if (t.toString().equals("ul"))
    {
      indentStack.push(new IndexType("ul"));
      newLine();
    }       
    else if (t.toString().equals("li"))
    {
      IndexType parent = indentStack.peek();
      if (parent.type.equals("ol"))
      {
        String numberString = "" + (++parent.counter) + ".";
        stringBuffer.append(numberString);
        for (int i = 0; i < (4 - numberString.length()); i++)
        {
          stringBuffer.append(" ");
        }
      }
      else
      {
        stringBuffer.append("*   ");
      }
      indentStack.push(new IndexType("li"));
    }  
    else if (t.toString().equals("dl"))
    {
      newLine();
    }
    else if (t.toString().equals("dt"))
    {
      newLine();
    }       
    else if (t.toString().equals("dd"))
    {
      indentStack.push(new IndexType("dd"));
      newLine();
    }       
  }

  private void newLine()
  {
    stringBuffer.append("\n");
    for (int i = 0; i < indentStack.size(); i++)
    {
      stringBuffer.append("    ");
    }    
  }

  public void handleEndTag(HTML.Tag t, int pos)
  {
    log.info("EndTag:" + t.toString());
    if (t.toString().equals("p"))
    {
      newLine();
    }   
    else if (t.toString().equals("ol"))
    {
      indentStack.pop();;
      newLine();
    }    
    else if (t.toString().equals("ul"))
    {
      indentStack.pop();;
      newLine();
    }    
    else if (t.toString().equals("li"))
    {
      indentStack.pop();;
      newLine();
    }     
    else if (t.toString().equals("dd"))
    {
      indentStack.pop();;
    }      
  }

  public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos)
  {
    log.info("SimpleTag:" + t.toString());
    if (t.toString().equals("br"))
    {
      newLine();
    }
  }

  public void handleText(char[] text, int pos)
  {
    log.info("Text:" + new String(text));
    stringBuffer.append(text);
  }

  public String getText()
  {
    return stringBuffer.toString();
  }

  public static void main(String args[])
  {
    String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
    System.out.println(convert(html));    
  }  

}
Mike
+2  A: 

This is actually dead simple with Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}
BalusC
Jsoup is nice, but I encountered some drawbacks with it. I use it to get rid of XSS, so basically I expect a plain text input, but some evil person could try to send me some HTML. Using Jsoup, I can remove all HTML but, unfortunately it also shrinks many spaces to one and removes link breaks (\n characters)
Ridcully
@Ridcully: for that you'd like to use [`Jsoup#clean()`](http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer) instead.
BalusC
A: 

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

rjha94
A: 

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/"&gt;TagSoup&lt;/a&gt; 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}
dfrankow