tags:

views:

46

answers:

3

I have a data set in the following pattern

1<a href="/contact/">Joe</a><br />[email protected]</div>
2<a href="/contact/">Tom</a><br />[email protected]</div>
3<a href="/contact/">Jerry</a><br />[email protected]</div>

So on...

I need to extract the name and email id alone from it. How do I do it?


Update:

Based on your responses, I've changed my data format to:

1(name)Joe(email)[email protected](end)
2(name)Tom(email)[email protected](end)
3(name)Jerry(email)[email protected](end)

How do I parse that?

+1  A: 

Don't use regular expressions to parse HTML.

Use an HTML parser. There are a bunch listed on this page. Based on my experience using Tidy, I would suggest JTidy. From their page:

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

UPDATE

Based on the edit to your question, use split() to split the string with \([a-z]+\) as a delimiter. This should give you the separate components:

String[] components = str.split("\\([a-z]+\\)");

Or you could use the more generic expression \(.*?\).

Vivin Paliath
This is no more a well formed HTML document.
Ragunath Jawahar
@Ragunath. If it's not a well-formed document, you can still run it through Tidy to tidy it up, and then parse it.
Vivin Paliath
@Vivin, Ok what would you do if the above dataset looked like **1(abc)Joe(def)[email protected](xyz)** forget HTML for a while.
Ragunath Jawahar
You could tokenize it using `\\([a-z]+\\)` as a separator (for that exact example that you have provided).
Vivin Paliath
@Vivin, I'll try. Thanks
Ragunath Jawahar
+1  A: 

Use this regex:

\(name\)(.*)\(email\)(.*)\(end\)

Now, the first backreference \1 contains the name, and the second backreference \2 contains the email address.

Keep calling the same regex to get the next name and email address.

Chetan
I was looking for this. Thanks @Chetan
Ragunath Jawahar
+1  A: 

If you are guaranteed that this will be the standard pattern for all of your entries, you can simply use String.split() on each line, using the regular expression (.*?) as the split pattern. This will match the ( followed by the least possible number of other characters, followed by another ). So the code looks something like this:

//for each String line
String[] items = line.split("\\(.*?\\)");
name = items[0];
email = items[1];
Zoe Gagnon
Thanks for the answer @Zoe
Ragunath Jawahar