tags:

views:

240

answers:

6

I am attempting to find xml files with large swaths of commented out xml. I would like to programmatically search for xml comments that stretch beyond a given number of lines. Is there an easy way of doing this?

+1  A: 

Considering that XML doesn't use a line based format, you should probably check the number of characters. With a regular expression, you can create a pattern to match the comment prefix and match a minimum number of characters before it matches the first comment suffix.

http://www.regular-expressions.info/

Here is the pattern that worked in some preliminary tests:

<!-- (.[^-->]|[\r\n][^-->]){5}(.[^-->]|[\r\n][^-->])*? -->

It will match the starting comment prefix and everything including newline character (on a windows OS) and it's lazy so it will stop at the first comment suffix.

Sorry for the edits, you are correct here is an updated pattern. It's obviously not optimized, but in some tests it seems to resolve the error you pointed out.

Sam
A: 

You mean something like?:

/<!--.{100}.*?-->/

That won't work when there are multiple comments, as it will skip till it finds the first end comment that is beyond the character count.

A: 

Thank you for the quick responses, but I'm not convinced that'll work exactly either. For a regex solution I think we need a look behind thinger:

/<!--.{100,}(?!-->)-->/

I'm curious, what are you using to execute the regex?

A: 

I'm using this application to test the regex:

http://www.regular-expressions.info/dotnetexample.html

I have tested it against some fairly good data and it seems to be pulling out only the commented section.

Sam
A: 

Hmm, I'm stuck in windows-land for now. cygwin grep doesn't have perl regex, and visual-studio regex doesn't like either of our regexes. I'll see what I can do later when I get ahold of a unix box.

Hi Chris, using .NET should work fine since that application is written in C#. If you download the files it contains the source code...
Sam
A: 

I'm not sure about number of lines, but if you can use the length of the string, here's something that would work using XPath.

static void Main(string[] args)
{
    string[] myFiles = { @"C:\temp\XMLFile1.xml", 
                         @"C:\temp\XMLFile2.xml", 
                         @"C:\temp\XMLFile3.xml" };
    int maxSize = 5;
    foreach (string file in myFiles)
    {
        System.Xml.XPath.XPathDocument myDoc = 
            new System.Xml.XPath.XPathDocument(file);
        System.Xml.XPath.XPathNavigator myNav = 
            myDoc.CreateNavigator();

        System.Xml.XPath.XPathNodeIterator nodes = myNav.Select("//comment()");
        while (nodes.MoveNext())
        {
            if (nodes.Current.ToString().Length > maxSize)
                Console.WriteLine(file + ": Long comment length = " + 
                  nodes.Current.ToString().Length);
        }


    }

    Console.ReadLine();
}
Mattio