tags:

views:

53

answers:

2

I have a regular expression that runs through html tags and grabs values. I currently have this to grab all values within the tag.

<title\b[^>]*>(.*\s?)</title>

It works perfectly. So if I have a bunch of pages that have titles:

<title>Index</title>

<title>Artwork</title>

<title>Theory</title>

The values returned are: Index, Artwork, Theory

How can I make this regular expression ignore all tags with the value Theory inside them?

Thanks in Advance

A: 

A basic look around would probably handle that.

<title\b[^>]*>(((?!Juju - Search Results).)*)(.*\s?)</title>
Snekse
That's a nice little program you have there but on my end there is no execute button to test.
Ricky
I've tested the above code and it still didn't work. Let's say for instance instead of the value Theory - I want to ignore the value "Juju - Search Results". The regex can even exclude values that begin with the first 4 words without even being concerned with spaces.
Ricky
Not sure I understand what you're getting at. I've updated the example with the regEx that should handle the case you mentioned.
Snekse
Thanks bro perfect.
Ricky
I had to switch the parentheses around and remove the last set of brackets and * but works like a charm!!
Ricky
Just out of curiosity, can you post your final pattern?
Snekse
A: 

If your file input_file.txt contains:

<title>Index</title>

<title>Artwork</title>

<title>Theory</title>

Then, the following command will remove the lines containing Theory from input_file.txt and put the result in output_file.txt.

sed '/Theory/d' input_file.txt > output_file.txt 

If you work in vim: This will delete the lines containing g/\v^(.*Theory)@!/d

djondal