views:

845

answers:

2

Hi, problem:

How do I strip all attributes from HTML tags in a string, except "alt" and "src" using Java?

And further.. how do I get the content from all "src" attributes in the string?

:)

+3  A: 

You can:

  • Implement a SAX parser;
  • Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
  • Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.

Whatever you do, don't try and do it with regular expressions.

cletus
I've tried to use the DOM parser, but this requires the html tags to be perfectly aligned, like in an xml file. I'm using this on user supplied input data, and that could be in any format!
A: 

OK, solved this somehow.

Used the HTMLCleaner library to parse the input data to a valid format.

Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.

(and some minor ugly hacks;) )

This was kind of a lot of work.