I am trying to delete part of a string that does not match my pattern. For example, in
<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd
I would like to delete
<P Class=KRCC><Font Color=lightpink>
How do I do that?
I am trying to delete part of a string that does not match my pattern. For example, in
<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd
I would like to delete
<P Class=KRCC><Font Color=lightpink>
How do I do that?
Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.
Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.
If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.
These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.
To match any paragraph tags:
<p.*?>(.*?)</p>
You would replace these matches with $1 (or whatever your syntax requires to access groups).
It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:
<p.*>(.*)</p>
Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:
<p>Lorem ipsum.</p><p>Lorem ipsum.</p>
Instead, it would match "<p>Lorem ipsum.</p><p>
" as the first portion, which would result in losing content.
If you need to match paragraphs with specific classes, you could use something like this:
<p.*?class="delete".*?>(.*?)</p>
Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:
<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag