views:

98

answers:

1

I have a set of html files that I want to modify by replacing the header and footer. The contents of each file is different and I would like to use a regular expression (or similar if RE can't handle multiline queries).

As an example, one modification I want to make is to replace everything between <html> and </head> with a standard header.

Can this be done with a regular expression? What method would you use to perform a bulk search and replace like this in C#?

Can you provide an example of a regular expression that matches multiple lines?

A: 

Well the simple answer is, yes.

Regex could indeed help you, but you need a tool that copes with multiple files. I can't recommend any at the moment, try Googling "multiple file search and replace". Regex can cope with multi-line or single-line matching.

I use Notepad++ which can sort of do what you want to do a search/replace in multiple files (open or within a directory tree), not it's primary aim, but it works.

The hard part is defining your "match" making sure that where you want to pick out details you need to preserve that you have an appropriate capture group that you can use in your "replace" expression.

So, again, yes it can help, but your question is very high level.

For the C# part, it's simple once you have your regex defined.

static void Main()
{ 
     // Remove everything (by commenting out) everything between HTML
     // and the end of the HEAD tag.
     string matchRegex = "<html[^>]*>(.*?)</head>";
     string replaceExpression = "<html> <!-- \0 </head> -->";

     string pattern = "*.html";

     using ( DirectoryInfo di = new DirectoryInfo(.) )
     {
          foreach (FileInfo fi in di.GetFiles(pattern))
          {
               using ( StreamReader sr = fi.OpenText() )
               {
                    // Going from memory here, may need to use a TextReader...
                    string content = fi.ReadToEnd();

                    // Treat as single-line so that the match can span
                    // several lines.
                    string newContent = Regex.Replace(content, 
                                                      matchRegex, 
                                                      replaceExpression,
                                                      RegexOptions.Singleline);

                    // Write-out/overwirte your new file here....
               }
          }
     }
}

You may find this page useful, in it, someone is trying to write a regular expression to match comments, then handle multiple line comments, etc. It shows the regex thought process. Finding Comments in source code. The replace part is easy, put a capture group in and reference the group/name in the replacement string!

Ray Hayes
I intend to write some C# code to loop through the collection of html files so won't be using a text editor for this. Do you have an example of how you perform a RE that will match over multiple lines?
NickGPS
Thanks for your help. I edited the question as I hadn't encoded the < and > so they got stripped out, making the question a bit ambiguous.I understand how to write a loop, what I'm looking for is an example of a regular expression that can match multiple lines?
NickGPS
Pass in RegexOptions.Multiline or RegexOptions.Singleline to change the behaviour of ^ and $. Multiline = "Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string."
Ray Hayes
I personally use Singleline so that I can have Multiple line captures. I just deal with the \r and \n to handle new lines. E.g. in a Singleline match, looking for "\r\n\r\n" will search for a blank line. Something like "[\r\n]{1,2}" can make it deal with Unix/Windows line endings.
Ray Hayes
Updated example to comment out the HTML between the start-HTML and the end of the HEAD.
Ray Hayes
Perfect, just what I was looking for. I didn't realise you could alter the behaviour of the start and end characters. Thanks again.
NickGPS