ansaurus

Question

How do we create such a regular expression to extract data?

Answer 1

+6 A:

If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser

RC 2009-11-19 15:22:48

To anyone pointing automatically to this one, quoting from the very same blog post: "Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing." It's absolutely OK to parse a HTML-like input if you keep this in mind.

candiru 2009-11-19 15:28:51

This question is missing the obligatory bobince reference: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

intgr 2009-11-19 15:29:17

@candiru: The asker explicitly asked for a regexp that is **"safe"**. Regexps are fine for one-off hacks, but they are certainly not safe.

intgr 2009-11-19 15:30:39

intgr: It's linked from Jeff's post I linked in the comment to the question. It's just another pointer to dereference :-)

Joey 2009-11-19 15:44:21

Answer 2

A:

Split the string at ( )+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.

If you want to preserve the  , then this is not possible unless you know that there is one before and after each element in the result.

Aaron Digulla 2009-11-19 15:24:02

Sorry, I misread the question.

Aaron Digulla 2009-11-19 15:45:49

You can still pre- and append an ` ` to each result, though. Not nice but if the OP *requires* the ` ` ...

Joey 2009-11-19 16:03:48

Answer 3

+1 A:

<br>.*?<br>

will match anything from one   tag to the closest following one.

The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.

Tim Pietzcker 2009-11-19 18:53:55

You are right, I need a non-greedy match.

bobo 2009-11-20 01:47:05

ansaurus

tags:

views:

answers:

How do we create such a regular expression to extract data?

related questions