tags:

views:

106

answers:

4

This is my string:

<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';

I want to produce this output:

Some data,More data

Right now, I do this in PHP to filter out the data:

$rePlaats = "#<br/>([^<]*)<br/>[^<]*<br/>';#";
$aPlaats = array();
preg_match($rePlaats, $lnURL, $aPlaats);    // $lnURL is the source string
$evnPlaats = $aPlaats[1];

This would work if it weren't for these <span> tags, as shown here:

<br/>Some data,More data<br/>(more data)<br/>';

I will have to rewrite the regex to tolerate HTML tags (except for <br/>) and strip out the <span> tags with the strip_tags() function. How can I do a "does not contain" operation in regex?

A: 

As has been said many times don't use regex to parse html. Use the DOM instead.

RMcLeod
Use the DOM in PHP? You're making me curious and skeptical at the same time
LorenVS
Should be a comment - you don't answer about negation in regular expressions. So people will go here from google by searching a text from the title will find nothing.
Kamarey
@Kamarey: They find that one should not parse HTML with regex, which seems to be the community consent: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Tomalak
Way to go in not answering the question. And getting up-voted for a popularist statement (regardless of validity). How about you provide a code example and explain how that is superior to the attempt made in the question.
PP
@Tomalak: I have no problem with this "ommunity consent", but this question is about something written in its title. There is no word about HTML. HTML is the specific problem here. So the title should be changed to be related to HTML or answers not directly related to the question (title) should be comments at least.
Kamarey
-1 for not answering the question and repeating a popular but untrue statement.Many HTMLs will give you an incomprehensible tree if you try DOM. In this particular case OP does not care if his text is a child of <span> element or not. All he needs to know is how they are separated with breaks. That's many lines of code untangling through the tree branches where a few simpple regexes would do.
yu_sha
A: 

If you really want to use regular expressions for this, then you're better off using regex replaces. This regex SHOULD match tags, I just whipped it up off the top of my head so it might not be perfect:

<[a-zA-Z0-9]{0,20}(\s+[a-zA-Z0-9]{0,20}=(("[^"]?")|('[^']?'))){0,20}\s*[/]{0,1}>

Once all the tags are gone the rest of the string manipulation should be pretty easy

LorenVS
+2  A: 

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after.

Step 1) Replace <br */?> with {break}

Step 2) Replace <[^>]*> with empty string

Step 3) Replace {break} with <br>

yu_sha
Good answer. And if you want to allow for angle brackets within tag attributes, use `<(?:"(?:\\"|[^"])*"|[^>])*>` in step 2.
Tim Pietzcker
+1  A: 

don't fret yourself with too much regex. use your normal PHP string functions

$str = "<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';";
$s = explode("</span>",$str);
for($i=0;$i<count($s)-1;$i++){
    print preg_replace("/.*>/","",$s[$i]) ."\n"; #minimal regex
}

explode on "</span>" , since the data you want to get is all near "</span>". Then go through every element of array , replace from start till ">". This will get your data. The last element is excluded.

output

$ php test.php
Some data
More data
ghostdog74