




I'm trying to select all text in-between following a specific pattern:

Sample Text:

"by thatonekid (Posted Mon Jan 12, 2009 7:17 pm)
fell onto the trail right below one of the most traveled walls at the point! yikes!


Every text I work on will start with: "by USERNAME (Posted DATE) <br /> theTextIWant"

I thought about exploding on the paren's, but obviously, that could break up the text if there's another paren.

Secondly, some of the texts end in "<br /><br />". I need to remove the trailing <br />'s if there is no text afterwards.

I apologize if this looks like I'm asking for someone to do my homework -- I honestly have no idea where to begin here


For instance, you could try thoses regexp, with preg_match I guess. see online doc.

username : [_a-zA-Z09]+
date: [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}

(sorry gtg, helping you more later if unsolved)

+3  A: 

If you only want the text after the username/date, you can simply remove everything before the first <br />, assuming your formatting is consistent.

$text = preg_replace("/^.*?<br(\s\/)?>/si", "", $string);

This would replace everything before and including the first <br> or <br />, case-insensitive, with an empty string, leaving you with just the text. The .*? at the beginning is a non-greedy match, meaning it will capture as little as possible. In other words, it won't grab past the first break.

You can then follow this with:

$text = preg_replace("/^.*?<br(?:\s\/)?>(.*?)(<br(\s\/)?>)+$/si", "$1", $string);

This should remove all ending whitespace and <br>/<br /> tags.

You could also do all of this with a single preg_replace:

$text = preg_replace("/.*?<br(?:\s\/)?>(.*)(?:<br(?:\s\/)?>\s*)+$/si", "$1", $string);

I made all of the () captures (?:) non-captures, except the one containing the text.

(I don't use php regularly, so I am assuming that a perl compatible regex is what it says it is).

Jeff B
Thanks Jeff!The first replace returns an empty string. Any ideas?The third one returns the following error:Message: preg_replace(): Compilation failed: unrecognized character after (? or (?- at offset 8
What about adding a capture after the first BR, then simply returning the capture? Any idea how that would work?
@jmccartie: OK, I fixed the problems in the first two. I failed to use a 's' modifier to tell it to match on the entire string, including multiple lines. Also, I added the ^ beginning of line match. As for the third one, the correct non-capturing syntax is (?:) not (?), so I fixed that, but for some reason it still isn't working for me.
Jeff B
@jmccartie: I figured out the third as well. Should work now.
Jeff B
Thanks, Jeff!!!