ansaurus

Question

Regular expression to delete HTML strings

Answer 1

+3 A:

Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.

Tomalak 2009-06-27 07:49:02

I agree, if you can use replace string function you will get performance benefit too

Stuart 2009-06-27 08:35:50

... and if you want to remove HTML nodes, use an HTML parser.

Svante 2009-06-27 10:42:45

Answer 2

+1 A:

Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.

Alan Haggai Alavi 2009-06-27 07:50:15

Answer 3

+1 A:

If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.

These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.

To match any paragraph tags:

<p.*?>(.*?)</p>

You would replace these matches with $1 (or whatever your syntax requires to access groups).

It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:

<p.*>(.*)</p>

Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:

<p>Lorem ipsum.</p><p>Lorem ipsum.</p>

Instead, it would match "<p>Lorem ipsum.</p><p>" as the first portion, which would result in losing content.

If you need to match paragraphs with specific classes, you could use something like this:

<p.*?class="delete".*?>(.*?)</p>

Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:

<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag

Brandon Gano 2009-06-27 08:25:56

Actually, I thought recommending regex to parse HTML was off limits.

Tomalak 2009-06-27 10:08:50

ansaurus

tags:

views:

answers:

Regular expression to delete HTML strings

related questions