ansaurus

Question

How do I write regular expression in Java that takes into account the context of the string I'm looking for?

Answer 1

+3 A:

This is a common question on SO and the answer is always the same: regular expressions are a poor and limited tool for parsing HTML because HTML is not a regular language.

You should be using an HTML parser, for example HTML Parser.

If you're curious what I mean by "regular language", have a look at JMD, Markdown and a Brief Overview of Parsing and Compilers. Basically a regular expression is a DFA (deterministic finite automaton or deterministic finite state machine). HTML requires a PDA (pushdown automaton) to parse. A PDA is a DFA with a stack. It's how it handles recursive elements.

cletus 2010-01-19 04:26:09

Thank you, this is very helpful. So is search function which I fail to use lately ;)

2010-01-19 04:28:24

Answer 2

A:

htm">(.+)</a></td>

Searches for any character (that's the .+ bit) that is between htm"> and </a></td> and return what's in between to use with Pattern.matcher() (which is why there are brackets around .+ )

http://www.regular-expressions.info/java.html

Cetra 2010-01-19 04:39:21

ansaurus

tags:

views:

answers:

How do I write regular expression in Java that takes into account the context of the string I'm looking for?

related questions