ansaurus

Question

Answer 1

A:

so you have HTML-escaped text in which you want to find elements? Why not just unescape it first and then use the code you already have? You can use HttpServerUtility.HtmlDecode() for that.

edit: try this then

string text = "PLAIN-TEXT&lt;gallery src=sss&gt;xxx&lt;/gallery&gt;PLAIN-TEXT";
while (text.IndexOf("&lt;") > -1)
    text = Regex.Replace(text, "&lt;\\w+.*?&lt;/\\w+&gt;", "");
Console.WriteLine(text);

in case it is confusing: the loop is for the nested tags. You could handle them with Regex to but that get complicated.

liho1eye 2010-08-08 15:29:54

Because then these tags would be indistinguishable from real tags.

Aillyn 2010-08-08 15:31:59

Answer 2

A:

This regex should (partially) work:

@"&lt;.+?&gt;(.*?)&lt;/.+?&gt;"

That being said, regex is not an appropriate tool for parsing (X)HTML. Here's a better solution:

Add an identifier after the <, ie: BOGUS000 : YourStr.Replace("<", "<BOGUS000")
Now convert the < and %gt; to < and > using HttpServerUtility.HtmlDecode()
Parse the file using an XML parser
Now you know all elements which have a name starting with your identifier (here BOGUS000) are, well, bogus. They can be removed.
Profit ! :)

I am not sure I understand your second issue.

Aillyn 2010-08-08 15:46:26

This expression will fail when tags will be nested.

Ventus 2010-08-08 16:02:37

no, it just won't remove all of them, but since ".*?" will ensure that only inner-most tag is matched, you can just execute it multiple times until not tags left.

liho1eye 2010-08-08 16:06:45

@Ventus Regex is not an appropriate tool to parse (X)HTML, this happens to be the same thing, but with different opening and closing tags. You do what you can.

Aillyn 2010-08-08 16:09:43

@Aillyn I'm not sure that this the same as (X)HTML, because document is not a structure. It's just a text containing some tags, not tags containing text...

Ventus 2010-08-08 16:14:31

@Ventus it does appear to be an XML fragment though, which is beyond the scope of a regular expression to fully describe.

Rex M 2010-08-08 16:18:12

@Ventus I've updated my answer

Aillyn 2010-08-08 16:22:12

@irkz: `.*?` will NOT ensure that the innermost tag is matched. Apply the regex `<(\w+)>.*?</\1>` to `<A><A></A></A>` and you'll see it matches from the first opening tag to the first closing tag, leaving the second `</A>` hanging.

Alan Moore 2010-08-09 05:37:48

Answer 3

A:

add RegexOptions.Singleline to the Regex.Replace() call (yes I know, it feels backward) to address the issue with tag spanning multiple lines not matching.

second issue: How is it not exactly the same problem? The regex is given to you - just substitute the bounding strings and done.

liho1eye 2010-08-08 19:13:10

second one is not the same as first one. First was: `<tagName>content</tagName>`, but second is `{| content |}`. The problem is that second one can also looks like this: `{| content {| nested content |} {| another nested content |} content |}`. For me it is totally different than first one.

Ventus 2010-08-08 19:20:37

No, it is the same. You have a blob of text which may contain things like `text[opening sequence]other text[opening sequence]more text[closing sequence]and even more text[closing sequence]text yet again`The `[opening sequence]` and `[closing sequence]` vary, but algorithm for resolving them is exactly the same.

liho1eye 2010-08-08 19:26:22

ansaurus

tags:

views:

answers:

Remove non-HTML special tags from text

related questions