ansaurus

Question

how to convert html text to plain text

Answer 1

+2 A:

Use a HTML parser like htmlCleaner

For detailed answer : http://stackoverflow.com/questions/1699313/how-to-remove-html-tag-in-java

Ankit Jain 2010-08-31 10:06:09

Answer 2

+1 A:

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

Jon Freedman 2010-08-31 10:07:22

Answer 3

A:

Just getting rid of HTML tags is simple:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

seanizer 2010-08-31 10:58:45

good that you clarified all that complexity!

Ankit Jain 2010-08-31 13:18:32

Answer 4

A:

Getting the Text in an HTLM Document

camickr 2010-08-31 15:31:19

Answer 5

A:

you can use this single line to remove the html tags and display as plain text...

htmlString=htmlString.replaceAll("\\<.*?\\>", ""));

Kandhasamy 2010-09-03 10:16:40

ansaurus

tags:

views:

answers:

how to convert html text to plain text

related questions