ansaurus

Question

How to extract data from the following using RegEx?

Answer 1

+1 A:

Don't use regular expressions to parse HTML.

Use an HTML parser. There are a bunch listed on this page. Based on my experience using Tidy, I would suggest JTidy. From their page:

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

UPDATE

Based on the edit to your question, use split() to split the string with \([a-z]+\) as a delimiter. This should give you the separate components:

String[] components = str.split("\\([a-z]+\\)");

Or you could use the more generic expression \(.*?\).

Vivin Paliath 2010-10-22 19:51:39

This is no more a well formed HTML document.

Ragunath Jawahar 2010-10-22 19:53:25

@Ragunath. If it's not a well-formed document, you can still run it through Tidy to tidy it up, and then parse it.

Vivin Paliath 2010-10-22 19:54:14

@Vivin, Ok what would you do if the above dataset looked like **1(abc)Joe(def)[email protected](xyz)** forget HTML for a while.

Ragunath Jawahar 2010-10-22 19:56:53

You could tokenize it using `\\([a-z]+\\)` as a separator (for that exact example that you have provided).

Vivin Paliath 2010-10-22 20:00:27

@Vivin, I'll try. Thanks

Ragunath Jawahar 2010-10-22 20:01:51

Answer 2

+1 A:

Use this regex:

\(name\)(.*)\(email\)(.*)\(end\)

Now, the first backreference \1 contains the name, and the second backreference \2 contains the email address.

Keep calling the same regex to get the next name and email address.

Chetan 2010-10-22 20:02:51

I was looking for this. Thanks @Chetan

Ragunath Jawahar 2010-10-22 20:40:23

Answer 3

+1 A:

If you are guaranteed that this will be the standard pattern for all of your entries, you can simply use String.split() on each line, using the regular expression (.*?) as the split pattern. This will match the ( followed by the least possible number of other characters, followed by another ). So the code looks something like this:

//for each String line
String[] items = line.split("\\(.*?\\)");
name = items[0];
email = items[1];

Zoe Gagnon 2010-10-22 20:02:56

Thanks for the answer @Zoe

Ragunath Jawahar 2010-10-22 20:40:47

ansaurus

tags:

views:

answers:

How to extract data from the following using RegEx?

Update:

related questions