tags:

views:

65

answers:

3

I'm having problem with matching non-HTML tags in text mainly, because tags starts with &lt; and ends with &gt; but not < and >. So instead <ref>xx</ref> i have &lt;ref&gt;xxx&lt;/ref&gt;. What I need to do is remove all such tags including their content.

The problem is that some tags may have attributes. I found nice answer here but still there's a problem.

Assuming that I have tag like: <gallery src=sss>xxx</gallery> this expression suits perfect:

@"<(?<Tag>\w+)[^>)]*>.*?</\k<Tag>>"

Reality is quite different and all special characters are escaped, so tag looks like: &lt;gallery src=sss&gt;xxx&lt;/gallery&gt;. My problem is to match this king of tags. So far I have this expression: @"\&lt\;(?<Tag>\w+)[^\&)]*\&gt\;.*?\&lt\;/\k<Tag>\&gt\;". It matches tags with no attributes, but not the one mentioned above. What am I missing?

Second issue is matching {| |} tags, because they can be nested. Can you help me with this as well? This expression doesn't do the job: @"\{\|(?:[^\|\}]|\{\|[^\|\}]*\|\})*\|\}"

Edit: To clarify second issue. I have to match strings that starts with opening tag {| then goes some text and ends with |} tags. This structure can be nested, so this: {| xxx {| yyy |} xxx |} is allowed. I don't know maximum nesting level unfortunately, but lets say that 1 should suit most cases.


Edit 2: This expressions works for my 1st issue @"\&lt\;(?<Tag>\w+).*?\&lt\;/\k<Tag>\&gt\;". I have noticed that it fails if there's a new line mark between opening and closing tags.

Edit 3: This do the job with second issue: @"\{\|(?>(?!\{\||\|\}).|\{\|(?<N>)|\|\}(?<-N>))*(?(N)(?!))\|\}"

Regards, Ventus

A: 

so you have HTML-escaped text in which you want to find elements? Why not just unescape it first and then use the code you already have? You can use HttpServerUtility.HtmlDecode() for that.

edit: try this then

string text = "PLAIN-TEXT&lt;gallery src=sss&gt;xxx&lt;/gallery&gt;PLAIN-TEXT";
while (text.IndexOf("&lt;") > -1)
    text = Regex.Replace(text, "&lt;\\w+.*?&lt;/\\w+&gt;", "");
Console.WriteLine(text);

in case it is confusing: the loop is for the nested tags. You could handle them with Regex to but that get complicated.

liho1eye
Because then these tags would be indistinguishable from real tags.
Aillyn
A: 

This regex should (partially) work:

@"&lt;.+?&gt;(.*?)&lt;/.+?&gt;"

That being said, regex is not an appropriate tool for parsing (X)HTML. Here's a better solution:

  1. Add an identifier after the &lt;, ie: BOGUS000 : YourStr.Replace("&lt;", "&lt;BOGUS000")
  2. Now convert the &lt; and %gt; to < and > using HttpServerUtility.HtmlDecode()
  3. Parse the file using an XML parser
  4. Now you know all elements which have a name starting with your identifier (here BOGUS000) are, well, bogus. They can be removed.
  5. Profit ! :)

I am not sure I understand your second issue.

Aillyn
This expression will fail when tags will be nested.
Ventus
no, it just won't remove all of them, but since ".*?" will ensure that only inner-most tag is matched, you can just execute it multiple times until not tags left.
liho1eye
@Ventus Regex is not an appropriate tool to parse (X)HTML, this happens to be the same thing, but with different opening and closing tags. You do what you can.
Aillyn
@Aillyn I'm not sure that this the same as (X)HTML, because document is not a structure. It's just a text containing some tags, not tags containing text...
Ventus
@Ventus it does appear to be an XML fragment though, which is beyond the scope of a regular expression to fully describe.
Rex M
@Ventus I've updated my answer
Aillyn
@irkz: `.*?` will NOT ensure that the innermost tag is matched. Apply the regex `<(\w+)>.*?</\1>` to `<A><A></A></A>` and you'll see it matches from the first opening tag to the first closing tag, leaving the second `</A>` hanging.
Alan Moore
A: 

add RegexOptions.Singleline to the Regex.Replace() call (yes I know, it feels backward) to address the issue with tag spanning multiple lines not matching.

second issue: How is it not exactly the same problem? The regex is given to you - just substitute the bounding strings and done.

liho1eye
second one is not the same as first one. First was: `<tagName>content</tagName>`, but second is `{| content |}`. The problem is that second one can also looks like this: `{| content {| nested content |} {| another nested content |} content |}`. For me it is totally different than first one.
Ventus
No, it is the same. You have a blob of text which may contain things like `text[opening sequence]other text[opening sequence]more text[closing sequence]and even more text[closing sequence]text yet again`The `[opening sequence]` and `[closing sequence]` vary, but algorithm for resolving them is exactly the same.
liho1eye