You basically need to replace each <br>
with \n
and each <p>
with \n\n
. So, at the points where you succeed to remove them, you need to insert the \n
and \n\n
respectively.
Here's a kickoff example with help of the Jsoup HTML parser (the HTML example is intentionally written that way so that it's hard if not nearly impossible to use regex for this).
public static void main(String[] args) throws Exception {
String text = br2nl("<p>p1l1<br/><!--</p>-->p1l2<br><!--<p>--></br><p id=p>p2l1<br class=b>p2l2</p>");
System.out.println(text);
System.out.println("-------------");
String html = nl2br(text);
System.out.println(html);
}
public static String br2nl(String html) {
Document document = Jsoup.parse(html);
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
return document.text().replaceAll("\\\\n", "\n");
}
public static String nl2br(String text) {
return text.replaceAll("\n\n", "<p>").replaceAll("\n", "<br>");
}
Output:
p1l1
p1l2
p2l1
p2l2
-------------
<p>p1l1<br>p1l2<br> <p>p2l1<br>p2l2
A bit hacky, but it works. Jonathan Hedley, the guy behind Jsoup, has however planned a Formatter which should make this stuff easier.