tags:

views:

111

answers:

3

I am trying to replace two or more occurences of <br/> (like <br/><br/><br/>) tags together with two <br/><br/> with the following pattern

Pattern brTagPattern = Pattern.compile("(<\\s*br\\s*/\\s*>\\s*){2,}", 
     Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

But there are some cases where '<br/> <br/>' tags come with a space and they get replaced with 4 <br/> tags which was actually supposed to be replaced with just 2 tags.

What can i do to ignore 2 or 3(few) spaces that come in between the tags ?

+1  A: 

Here's some Groovy code to test your Pattern:

import java.util.regex.*

Pattern brTagPattern = Pattern.compile( "(<\\s*br\\s*/\\s*>\\s*){2,}", Pattern.CASE_INSENSITIVE | Pattern.DOTALL )
def testData = [
  ['',                            ''],
  ['<br/>',                       '<br/>'],
  ['< br/> <br />',               '<br/><br/>'],
  ['<br/> <br/><br/>',            '<br/><br/>'],
  ['<br/>   < br/ > <br/>',       '<br/><br/>'],
  ['<br/> <br/>   <br/>',         '<br/><br/>'],
  ['<br/><br/><br/> <br/><br/>',  '<br/><br/>'],
  ['<br/><br/><br/><b>w</b><br/>','<br/><br/><b>w</b><br/>'],
 ]

testData.each { inputStr, expected ->
  Matcher matcher = brTagPattern.matcher( inputStr )
  assert expected == matcher.replaceAll( '<br/><br/>' )
}

And everything seems to pass fine...

tim_yates
@tim_yates: thanks buddy...it was just an issue raised to me by one of my colleagues..i thght that it was a valid issue...guess something else was causing the issue...
Arun Abraham
Your code won't work with the `<br/><br/><br/> hello`, you will return `<br/><br/>hello` instead of `<br/><br/> hello`. The question request to ignore *only* the spaces between the <br/> tags.
greuze
+1  A: 

Probably not the answer you want to hear, but it is general wisdom that you should not attempt to parse XML/HTML with regular expressions. So many things can go wrong -- it's a much better idea to use a parsing library specifically meant for such data, which will also completely bypass the issue you're having.

Take a look at JAXB if you are certain your HTML is well-formed XML, or if the HTML is likely to be messy and incompliant (like most real-world HTML) you should try something like TagSoup.

Adrian Petrescu
+1, and requisite link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Kirk Woll
+1  A: 

You can do that changing a little your regex:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>\\s*<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

This will ignore every spaces between two
. If you just want exactly 2 or three, you can use:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>(\\s){2,3}<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
greuze