views:

10341

answers:

7

I have the following string and I would like to remove <bpt *>*</bpt> and <ept *>*</ept> (notice the additional tag content inside them that also needs to be removed) without using a XML parser (overhead too large for tiny strings).

The big <bpt i="1" x="1" type="bold"><b></bpt>black<ept i="1"></b></ept> <bpt i="2" x="2" type="ulined"><u></bpt>cat<ept i="2"></u></ept> sleeps.

Any regex in VB.NET or C# will do.

+1  A: 

I presume you want to drop the tag entirely?

(<bpt .*?>.*?</bpt>)|(<ept .*?>.*?</ept>)

The ? after the * makes it non-greedy, so it will try to match as few characters as possible.

One problem you'll have is nested tags. stuff would not see the second because the first matched.

davenpcj
A: 

Does the .NET regex engine support negative lookaheads? If yes, then you can use

(<([eb])pt[^>]+>((?!</\2pt>).)+</\2pt>)

Which makes The big black cat sleeps. out of the string above if you remove all matches. However keep in mind that it will not work if you have nested bpt/ept elements. You might also want to add \s in some places to allow for extra whitespace in closing elements etc.

Torsten Marek
+4  A: 

If you just want to remove all the tags from the string, use this (C#):

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

EDIT:

I decided to add on to my solution with a better option. The previous option would not work if there were embedded tags. This new solution should strip all <*pt> tags, embedded or not. In addition, this solution uses a back reference to the original [be] match so that the exact matching end tag is found. This solution also creates a reusable Regex object for improved performance so that each iteration does not have to recompile the Regex:

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</\1pt>");
    while(regex.IsMatch(yourstring) ) {
     yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

ADDITIONAL NOTES:

In the comments a user expressed worry that the '.' pattern matcher would be cpu intensive. While this is true in the case of a standalone greedy '.', the use of the non-greedy character '?' causes the regex engine to only look ahead until it finds the first match of the next character in the pattern versus a greedy '.' which requires the engine to look ahead all the way to the end of the string. I use RegexBuddy as a regex development tool, and it includes a debugger which lets you see the relative performance of different regex patterns. It also auto comments your regexes if desired, so I decided to include those comments here to explain the regex used above:

    // <([be])pt[^>]+>.+?</\1pt>
// 
// Match the character "<" literally «<»
// Match the regular expression below and capture its match into backreference number 1 «([be])»
//    Match a single character present in the list "be" «[be]»
// Match the characters "pt" literally «pt»
// Match any character that is not a ">" «[^>]+»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the character ">" literally «>»
// Match any single character that is not a line break character «.+?»
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
// Match the characters "</" literally «</»
// Match the same text as most recently matched by backreference number 1 «\1»
// Match the characters "pt>" literally «pt>»
tyshock
Nice one, except the use of "." which is pretty cpu intensive, that matters if you process a big xml file. You could just replace it by "[^<>]", couldn't you ?
e-satis
Sorry, for the subtag, you just can't. Better use "[^ø]" instead.
e-satis
+1  A: 

Why do you say the overhead is too large? Did you measure it? Or are you guessing?

Using a regex instead of a proper parser is a shortcut that you may run afoul of when someone comes along with something like <bpt foo="bar>">

Andy Lester
Well, using a regex or some other crutch is the only thing you can do when you have non-wellformed XML. The markup in the question is not XML, it has intersecting hierarchies.
Torsten Marek
A: 

If you're going to use a regex to remove XML elements, you'd better be sure that your input XML doesn't use elements from different namespaces, or contain CDATA sections whose content you don't want to modify.

The proper (i.e. both performant and correct) way to do this is with XSLT. An XSLT transform that copies everything except a specific element to the output is a trivial extension of the identity transform. Once the transform is compiled it will execute extremely quickly. And it won't contain any hidden defects.

Robert Rossney
A: 

is there any possible way to get a global solution of the regex.pattern for xml type of text? that way i"ll get rid of the replace function and shell use the regex. The trouble is to analyze the < > coming in order or not.. Also replacing reserved chars as ' & and so on. here is the code 'handling special chars functions Friend Function ReplaceSpecChars(ByVal str As String) As String Dim arrLessThan As New Collection Dim arrGreaterThan As New Collection If Not IsDBNull(str) Then

  str = CStr(str)
  If Len(str) > 0 Then
    str = Replace(str, "&", "&amp;")
    str = Replace(str, "'", "&apos;")
    str = Replace(str, """", "&quot;")
    arrLessThan = FindLocationOfChar("<", str)
    arrGreaterThan = FindLocationOfChar(">", str)
    str = ChangeGreaterLess(arrLessThan, arrGreaterThan, str)
    str = Replace(str, Chr(13), "chr(13)")
    str = Replace(str, Chr(10), "chr(10)")
  End If
  Return str
Else
  Return ""
End If

End Function Friend Function ChangeGreaterLess(ByVal lh As Collection, ByVal gr As Collection, ByVal str As String) As String For i As Integer = 0 To lh.Count If CInt(lh.Item(i)) > CInt(gr.Item(i)) Then str = Replace(str, "<", "<") /////////problems//// End If

  Next


    str = Replace(str, ">", "&gt;")

End Function Friend Function FindLocationOfChar(ByVal chr As Char, ByVal str As String) As Collection Dim arr As New Collection For i As Integer = 1 To str.Length() - 1 If str.ToCharArray(i, 1) = chr Then arr.Add(i) End If Next Return arr End Function

got trouble at problem mark

that's a standart xml with different tags i want to analyse..

A: 

Have you measured this? I have run into performance issues using .NET's regex engine, but by contrast have parsed xml files of around 40GB without issue using the Xml parser (you will need to use XmlReader for larger strings, however).

Please post a an actual code sample and mention your performance requirements: I doubt the Regex class is the best solution here if performance matters.

Eamon Nerbonne