tags:

views:

309

answers:

4

I need to find a certain chunk from a group of HTML files and delete it from them all. The files are really hacked up HTML, so instead of parsing it with HtmlAgility pack as I was trying before, I would like to use a simple regex.

the section of html will always look like this:

<CENTER>some constant text <img src=image.jpg> more constant text: 
 variable section of text</CENTER>

All of the above can be any combination of upper and lower case, and notice that it is img src=image.jpg and not img src="image.jpg"... And there can be any number of white space characters in between the constant characters.

here are some examples:

    <CENTER>This page has been visited 
<IMG SRC=http://place.com/image.gif ALT="alt text">times since 10th July 2007
</CENTER>

or

    <center>This page has been visited 
<IMG src="http://place.com/image.gif" Alt="Alt Text"> 
times since 1st October 2005</center>

What do you think would be a good way to match this pattern?

+2  A: 

How much of that text is needed to uniquely identify the target? I would try this first:

@"(?is)<center>\s*This\s+page\s+has\s+been\s+visited.*?</center>"
Alan Moore
you read my mind :) Thank you.
Alex Baranosky
Would you mind explaining (?is:)?
Alex Baranosky
Ignore case (i) and single line (s)--e.g. don't worry about capitalization and line breaks.
MarkusQ
I just realized the colon isn't needed when you use it the way I did, so I removed that. Here's a complete explanation: http://www.regular-expressions.info/modifiers.html
Alan Moore
you 100% sure this regex will work? It isn't finding any matches, or I am messing something up :)
Alex Baranosky
Oh, I hadn't seen the above comment :)
Alex Baranosky
It works like a charm... so far :) Thanks a lot!
Alex Baranosky
+1  A: 

It really depends on how simple you can make the regex and match the desired elements.

<center>[^<]+<img[^>]+>[^>]+</center>

Use the case-insensitive flag too (I don't know what C# uses). If you need something more developed because you'll have situations where an img tag sits within center tags and not match, then you can start hardcoding phrases like the other answer.

qpingu
A: 

In C# you could simply use this, assuming that originalHTML contains your whole HTML file.

string result = null;
result = Regex.Replace(originalHtml,
                       @"(\s*<center>[^<]*<img src=[^""].*?>.*?</center>\s*)", 
                       "", 
                       RegexOptions.Singleline | RegexOptions.IgnoreCase);

The Regex will remove any occurrence of the pattern in the original HTML and return the modified version.

Renaud Bompuis
A: 

I ought you to test RegExBuddy (not free but low price) because this tool saved me a lot of time.

Hope this helps.

labilbe