tags:

views:

228

answers:

3

I am trying to delete part of a string that does not match my pattern. For example, in

<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd

I would like to delete

<P Class=KRCC><Font Color=lightpink>

How do I do that?

+3  A: 

Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.

Tomalak
I agree, if you can use replace string function you will get performance benefit too
Stuart
... and if you want to remove HTML nodes, use an HTML parser.
Svante
+1  A: 

Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.

Alan Haggai Alavi
+1  A: 

If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.

These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.

To match any paragraph tags:

<p.*?>(.*?)</p>

You would replace these matches with $1 (or whatever your syntax requires to access groups).

It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:

<p.*>(.*)</p>

Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:

<p>Lorem ipsum.</p><p>Lorem ipsum.</p>

Instead, it would match "<p>Lorem ipsum.</p><p>" as the first portion, which would result in losing content.

If you need to match paragraphs with specific classes, you could use something like this:

<p.*?class="delete".*?>(.*?)</p>

Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:

<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag
Brandon Gano
Actually, I thought recommending regex to parse HTML was off limits.
Tomalak