tags:

views:

102

answers:

3

I have a text like:

I've got a date with this fellow tomorrow. Well me and thousands of others. <br /><br /><img src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg&quot;&gt;&lt;br /><br />Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak. <br /><br />You all should come too!<br /><br /><a href="http://nh.barackobama.com/manchesterchange&quot;&gt;RSVP for the event</a>

I would want to like to clean it too :

I've got a date with this fellow tomorrow. Well me and thousands of others http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak.You all should come too! h**p://nh.barackobama.com/manchesterchange RSVP for the event

I would like to write a JAVA program for the same. Any pointers/suggestions would be appreciated.The tags aren't limited to the above post. This was just an example.

Thanks!

PS: Replace *'s by t's in the second hyperlink as Stack Overflow doesn't allow me to post more than one link.

A: 

I would check out an HTML parser such as JTidy. Despite its name it will parse HTML and provide a useful API to allow you to extract what you need.

Brian Agnew
Hi Brian,Thanks for the reply. The problem is that it isn't an HTML file. It's just a block of text. I am not sure if JTidy/Jericho would help as there's no tags like <body> <table> etc.
Denzil
JTidy was my initial idea, but what he actually wants is to get rid of all tags whatsoever.
Bozho
Bozho, I will try JTidy and reply. I am not sure if JTidy is able to process a block of text (not well formed HTML) and return a clean block of text.
Denzil
Thanks folks, got it solved.
Denzil
+1  A: 

JTidy will do what you want. I just tried it by saving the block of text in your post as test.txt, and ran JTidy with these options:

java -jar jtidy-r938.jar -asxml test.txt >test.html

It produced the following well-formed XHTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<title></title>
</head>
<body>
I've got a date with this fellow tomorrow. Well me and thousands of
others. <br />
<br />
<img
src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg" /><br />
<br />
Tomorrow morning I will be getting up at stupid o'clock and driving
up to Manchester, NH to see Barak Obama speak. <br />
<br />
You all should come too!<br />
<br />
<a href="http://nh.barackobama.com/manchesterchange"&gt;RSVP for the
event</a>
</body>
</html>

If you use the API instead of the command line, you will be able to extract the bits you are interested in and discard the rest.

Jason Day
Jason, Thanks for the reply. I will check JTidy and get back to you.
Denzil
A: 

The simplest way of 'tidying' text which has XML tags is to use a regular expression that identifies anything that is a tag (i.e. anything that starts with '<' and ends with '>' and everything in between). Note this works whether or not XML is 'well-formed' as it cleans up any tags regardless of whether opening tags match with closing tags.

For example,

String noXmlString = xmlString.replaceAll("\\<.*?\\>", "");

will remove all tags from a given string. The downside is that it won't preserve the image link nor the hyperlink as per your example. Hope this helps though!

Edited 11:58 04/04/10: Try this to remove HTML encoded HTML tags (i.e.. anything that starts with &lt; and ends with &gt;)...

String noHtmlHtmlString = htmlHtmlString.replaceAll("&lt;.+?&gt;", "");

Then to remove any other HTML encoded/formatted bits like &quot; (i.e. anything that starts with & and ends with ; and in between conforms to a valid word without spaces or breaks) use

String noHtmlEncodingString = htmlEncodingString.replaceAll("&\\w+?;", "");

If there's any malformed HTML/XML beyond those, unless there's a known pattern there's no way of catching them.

David Johnson
Hi David, The "javax.swing.text.html.HTMLEditorKit" may not work for the fact that my file is not a HTML file. It's NOT well formed HTML or XML for that matter. It's a text file which contains a block of text and unfortunately the text has HTML tags like the ones mentioned in the comments above to Peter Lang.
Denzil
<br /> seems to be an HTML encoded version of rendering <br />, so it's HTML for rendering HTML to show a user! I can add another regex in my solution for you to remove those.
David Johnson