ansaurus

Question

How can I do a "does not contain" operation in regex?

Answer 1

A:

As has been said many times don't use regex to parse html. Use the DOM instead.

RMcLeod 2009-11-25 11:23:31

Use the DOM in PHP? You're making me curious and skeptical at the same time

LorenVS 2009-11-25 11:29:18

Should be a comment - you don't answer about negation in regular expressions. So people will go here from google by searching a text from the title will find nothing.

Kamarey 2009-11-25 11:30:40

@Kamarey: They find that one should not parse HTML with regex, which seems to be the community consent: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Tomalak 2009-11-25 11:42:06

Way to go in not answering the question. And getting up-voted for a popularist statement (regardless of validity). How about you provide a code example and explain how that is superior to the attempt made in the question.

PP 2009-11-25 11:43:10

@Tomalak: I have no problem with this "ommunity consent", but this question is about something written in its title. There is no word about HTML. HTML is the specific problem here. So the title should be changed to be related to HTML or answers not directly related to the question (title) should be comments at least.

Kamarey 2009-11-25 11:50:56

-1 for not answering the question and repeating a popular but untrue statement.Many HTMLs will give you an incomprehensible tree if you try DOM. In this particular case OP does not care if his text is a child of element or not. All he needs to know is how they are separated with breaks. That's many lines of code untangling through the tree branches where a few simpple regexes would do.

yu_sha 2009-11-25 11:55:28

Answer 2

A:

If you really want to use regular expressions for this, then you're better off using regex replaces. This regex SHOULD match tags, I just whipped it up off the top of my head so it might not be perfect:

<[a-zA-Z0-9]{0,20}(\s+[a-zA-Z0-9]{0,20}=(("[^"]?")|('[^']?'))){0,20}\s*[/]{0,1}>

Once all the tags are gone the rest of the string manipulation should be pretty easy

LorenVS 2009-11-25 11:38:37

Answer 3

+2 A:

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after.

Step 1) Replace   with {break}

Step 2) Replace <[^>]*> with empty string

Step 3) Replace {break} with  

yu_sha 2009-11-25 11:52:38

Good answer. And if you want to allow for angle brackets within tag attributes, use `<(?:"(?:\\"|[^"])*"|[^>])*>` in step 2.

Tim Pietzcker 2009-11-25 12:15:55

Answer 4

+1 A:

don't fret yourself with too much regex. use your normal PHP string functions

$str = "<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';";
$s = explode("</span>",$str);
for($i=0;$i<count($s)-1;$i++){
    print preg_replace("/.*>/","",$s[$i]) ."\n"; #minimal regex
}

explode on "" , since the data you want to get is all near "". Then go through every element of array , replace from start till ">". This will get your data. The last element is excluded.

output

$ php test.php
Some data
More data

ghostdog74 2009-11-25 11:52:55

ansaurus

tags:

views:

answers:

How can I do a "does not contain" operation in regex?

related questions