tags:

views:

51

answers:

2

Hello, I want to parse a HTML code and create objects from their text representation in table. I have several columns and I want to save context of certain columns on every row. Now, I have the HTML code and I understand I should use Pattern and Matcher to get those strings, but I don't know how to write required regular expression.

This is a row I will be parsing:

<tr><td><a href="delirium.htm">Delirium</a></td><td>65...</tr>

So, I want to extract Delirium from that string. How do I write regular expression that sais

get me the string that is between the string htm"> and </a></td>

?

+3  A: 

This is a common question on SO and the answer is always the same: regular expressions are a poor and limited tool for parsing HTML because HTML is not a regular language.

You should be using an HTML parser, for example HTML Parser.

If you're curious what I mean by "regular language", have a look at JMD, Markdown and a Brief Overview of Parsing and Compilers. Basically a regular expression is a DFA (deterministic finite automaton or deterministic finite state machine). HTML requires a PDA (pushdown automaton) to parse. A PDA is a DFA with a stack. It's how it handles recursive elements.

cletus
Thank you, this is very helpful. So is search function which I fail to use lately ;)
A: 
htm">(.+)</a></td>

Searches for any character (that's the .+ bit) that is between htm"> and </a></td> and return what's in between to use with Pattern.matcher() (which is why there are brackets around .+ )

http://www.regular-expressions.info/java.html

Cetra