tags:

views:

68

answers:

4

I have text similar like this:

<html><p>this is <b>the</b> text</p> and <p>this is another text</p></html>

and I need to get this text using regexp

this is <b>the</b> text

Problem is, when I use simple regexp like this (<html>.*</p>) I'm getting whole text until the last occurence of </p>

Can anyone help me?

thanks lennyd

+3  A: 

You need a non-greedy match:

<html>.*?</p>

Also, you might want to consider using an HTML parser instead of regular expressions for this task.

Mark Byers
nice to see an regex answer to an HTML question :-) You know, you should really be using an HTML parser for this instead.
Mike Sherov
Cool, it's working, thanks for help(I can't use html parser in this case, otherwise i do]
lennyd
@Mike: Yeah, my reputation is ruined now! ;-)
Mark Byers
+2  A: 
Dominik
A: 

To capture the data in between para tags you may use regexp with positive look-ahead assertion /<p>(.*)(?=<\/p>)/, which is more greedy then .*? and works slower, but may be helpful for you. Also make sure that your HTML is valid, that means:

  1. All para tags are closed. HTML browsers close para tags, when they enter another block.
  2. Para tags are not nested :) Otherwise you have problems with any regex.
dma_k
A: 

Silly question, still using pure regex, why not just strip any <..> inside paragraphs? THEN grab the phrases using something like [^<]
?

Luxvero