tags:

views:

408

answers:

3

I am attempting to clean up some dodgy xml attributes with Regular expressions.

My input string is this

<TD X:NUM class=xl101P24_2>I Want to send a FAX:but not </TD>

My intended output string is this

<TD class=xl101P24_2>I Want to send a FAX:but not </TD>

My code now looks like this

public static Regex regex1 = new Regex(
      "<\\w*\\s*(X:\\w*)",
    RegexOptions.IgnoreCase
    | RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );

public void doRegex()
{
    string InputText = @"<TD X:NUM class=xl101P24_2>I Want to send a FAX:but not </TD>";

    string result = regex1.Replace(InputText,"");

    //result now = " class=xl101P24_2>I Want to send a FAX:but not </TD>"
}

so I need to do the replace but on only want to replace the numbered sub-match i.e. the 'X:NUM'. How do I do this???

Michael

+4  A: 

You should use a look-ahead construct (match prefix but exclude it). This way, the first part (the "<TD " part) will not be matched and also not replaced:

"(?<=<\\w*)\\s*(X:\\w*)"
Philippe Leybaert
fantastic, thats it. for ref, the final pattern is "(?<=<\\w*\\s*)(X:\\w*)"
Michael Dausmann
A: 

Here is the regex way to do it. Wondering why dont you do it using XSL or XML parsing (remove attribute") :-)

public static Regex regex1 = new Regex("^<\\w*\\s*td\\w*\\s*(X:\\w*)",
RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);


or "^<\\w*\\s*td\\w*\\s*(X:\\w*)"
Ratnesh Maurya
I can't use XML parsing because the attribute is not well formed. I am trying to clean up the stoopid raw text so i CAN parse it as xml.
Michael Dausmann
A: 

Another way to acheive this is to use a replacement string to replace the whole match with only the first group ignoring the second group containing the crap.

string sResult = Regex.Replace(sInput, @"(<\w*\s*)(X:\w*\s*)", "$1")

This does not require any look-aheads and so should be quicker (a simple run showed it to be an order of magnitude quicker).

Changing the regex to have a + after the second group will remove all X: attributes, not only the first one (if this is relevant).

string sResult = Regex.Replace(sInput, @"(<\w*\s*)(X:\w*\s*)+", "$1")
Stevo3000