tags:

views:

66

answers:

2

I'm trying to view the text of HTML files in a reasonable way. After I remove all of the markup and retain only the visible text, I obtain a String that looks something like this:

\n\n\n\n \n\n\n \n\n \n Title here \n\n\n \n\n \n\n Menu Item 1 \n\n \n\n Menu Item 2 \n\n\n \n\n you get the       point.

I would like to use String.replaceAll(String regex, String regex) to replace any whitespace substring that contains more than two occurances of \n with "\n\n".

Any ideas?

*Edit: *

Sorry for lack of precision. I would like the above texts changed to:

\n\nTitle here\n\nMenu Item 1\n\nMenu Item 2\n\nyou get the       point.

I want any substring that is only whitespace and contains more than two newlines to be replaced by "\n\n".

+5  A: 
str.replaceAll("\\s*\n\\s*\n\\s*\n\\s*", "\n\n")

This will replace any whitespace-substring that contains more than 2 \n and replace it with \n\n.

The Java regex reference I always use is located here. It should help you build regular expressions in the future.

jjnguy
Thanks much. You are missing one backslash in the middle (should be `\\s` of course) but this is what I wanted.
FarmBoy
@FarmBoy, thanks for the catch. Glad I could help.
jjnguy
+1  A: 

Another option:

  str.replaceAll("(?m)\\s*$", "\n").replaceAll("\n{3,}", "\n\n");

This is a little less efficient (two replaces) but much more clean for me -easy to understand and modify. The first replace is useful in many cases (and might be inside your previous cleaning), it makes sure that each line has no trailing blanks, and that it has a plain \n terminator. The second one express clearly your goal.

leonbloy