views:

105

answers:

5

How do I find if a string contains HTML data or not? The user provides input via rich:editor component and it's quite possible he could have used either a simple text or used HTML formatting.

+2  A: 

You can use regular expressions to search for HTML tags.

Tom Gullen
Ah, good old problem #2. Tom is correct, regex is the most direct way to get the job done, and there are usually lots of examples online to help you get going.
Alex Larzelere
@Alex Larzelere: problem #2? Can you explain? Is this an xkcd reference ("now you've got two problems"), or something else?
CPerkins
@Cperkins that's it exactly. Ol' problem #2, problem #1 of course is whatever you were trying to do originally.
Alex Larzelere
+2  A: 

In your backing bean, you can try to find html tags such as <b> or <i>, etc... You can use regular expressions (slow) or just try to find the "<>" chars. It depends on how sure you want to be that the user used html or not.

Keep in mind that the user could write <asdf>. If you want to be 100% sure that the html used is valid you will need to use a complex html parser from some library (TidyHTML maybe?)

pakore
A: 

You have to get help only by the regular expression strings. They help you find out potential html tags. You can then compare the inner to contain any html keywords. If its found, put up an alert telling not to use HTML. Or simply delete it if you feel otherwise.

1s2a3n4j5e6e7v
+1  A: 

If you don't want the user to have HTML in their input, you can replace all '<' characters with their HTML entity equivalent, '& lt;' and all '>' with '& gt;' (no spaces between & and g)

Tom Gullen
+2  A: 

What would you like to do with this information? Show a validation error to the user that s/he shouldn't enter HTML? For that regex may suffice. Basically you just need to check if the string contains the pattern <sometag...>

boolean containsHTML = value.matches(".*\\<[^>]+>.*");

Or do you want to remove all HTML? Then regex is unsuitable since it can't reliably remove/replace the real HTML. Better use a HTML parser like Jsoup. It's then as easy as:

String text = Jsoup.parse(value).text();

You can if necessary also compare afterwards to see if it contained HTML:

boolean containedHTML = !text.equals(value);

Alternatively, if your whole concern was XSS, then you can also just ignore this all and redisplay the text in an escaped form. E.g. the < will be escaped as &lt;, the > as &gt; and thus get displayed as-is in the final HTML. The JSF <h:outputText> component already does that by default:

<h:outputText value="#{bean.value}" />
BalusC