tags:

views:

70

answers:

1

I have the following html code segment:

        <br>
        Date: 2010-06-20,  1:37AM PDT<br>
        <br>
        Daddy: <a href="...">www.google.com</a>
        <br>

I want to extract

Date: 2010-06-20, 1:37AM PDT

and

Daddy: <a href="...">www.google.com</a>

with the help of java regex.

So what regex I should use?

+1  A: 

This should give you a nice starting point:

    String text = 
    "        <br>\n" +
    "        Date: 2010-06-20,  1:37AM PDT<br>   \n" +
    "   <br>    \n" +
    "Daddy: <a href=\"...\">www.google.com</a>   \n" +
    "<br>";

    String[] parts = text.split("(?:\\s*<br>\\s*)+");
    for (String part : parts) {
        System.out.println("[" + part + "]");
    }

This prints (as seen on ideone.com):

[]
[Date: 2010-06-20,  1:37AM PDT]
[Daddy: <a href="...">www.google.com</a>]

This uses String[] String.split(String regex). The regex pattern is "one or more of <br>, with preceding or trailing whitespaces.


Guava alternative

You can also use Splitter from Guava. It's actually a lot more readable, and can omitEmptyStrings().

    Splitter splitter = Splitter.on("<br>").trimResults().omitEmptyStrings();
    for (String part : splitter.split(text)) {
        System.out.println("[" + part + "]");
    }

This prints:

[Date: 2010-06-20,  1:37AM PDT]
[Daddy: <a href="...">www.google.com</a>]

Related questions

polygenelubricants
Also maybe you want something like this? http://www.rubular.com/r/wy3b1ABsaC Leave a comment and I'll elaborate on any of these approaches.
polygenelubricants
Also check this one out: http://www.rubular.com/r/mftjWgKWzP Tell me which one you fancy.
polygenelubricants