ansaurus

Question

Regex for specifig tags and their content, groupped by the tag name

Answer 1

+1 A:

Is the data proper xml, or does it just look like it?

If it is html, then the HTML Agility Pack is worth investigation - this provides a DOM (similar to XmlDocument) that you can use to query the data:

string input = @"<html>...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...</html>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(input);
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//user | //message"))
            {
                Console.WriteLine("{0}: {1}", node.Name, node.InnerText);
                // or node.InnerHtml to keep the formatting within the content
            }

This outputs:

user:  hello mitch
message:  some html message bla

If you want the formatting tags, then use .InnerHtml instead of .InnerText.

If it is xml, then to code with the full spectrum of xml, it would be better to use an xml parser. For small-to-mid size xml, loading it into a DOM such as XmlDocument would be fine - then query the nodes (for example, "//*"). For huge xml, XmlReader might be an option.

If the data doesn't have to worry about the full xml, then some simple regex shouldn't be too tricky... a simplified example (no attributes, no namespaces, no nested xml) might be:

string input = @"blah <tag1> content for tag 1 </tag1> blop
<tag2> content for tag 2 </tag2> bloop
<tag3> content for tag 3 </tag3> blip";

        const string pattern = @"<(\w+)>\s*([^<>]*)\s*</(\1)>";
        Console.WriteLine(Regex.IsMatch(input, pattern));
        foreach(Match match in Regex.Matches(input, pattern)) {
            Console.WriteLine("{0}: {1}", match.Groups[1], match.Groups[2]);
        }

Marc Gravell 2008-10-14 09:43:28

Data is not valid xml, but html page.

mitch 2008-10-14 09:48:29

I'll update to mention HTML Agility Pack

Marc Gravell 2008-10-14 09:56:01

This looks very interesting, I will check it out, tnx.

mitch 2008-10-14 10:19:37

Answer 2

A:

Regex for this might be:

/<([^>]+)>([^<]+)<\/\1>/

But it's general as I don't know much about the escaping machanism of .NET. To translate it:

first group matches the first tag's name between < and >
second group matches the contents (from > to the next <
the end check if the first tag is closed

HTH

Zsolt Botykai 2008-10-14 09:46:14

I tried it, but it doesn't match anything.

mitch 2008-10-14 09:49:38

Note that, due to the [^<] character class for the tag content, this will fail on nested tags. .*? would be needed if nested tags are to be allowed. (Comment based on PCRE, which may or may not be equivalent to .NET's regex engine.)

Dave Sherohman 2008-10-14 11:36:30

Answer 3

+1 A:

I don't see why you would want to use match group names for that.

Here is a regular expression that would match tag name and tag content into numbered sub matches.

<(tag1|tag2|tag3)>(.*?)</$1>

Here is a variant with .NET style group names

<(?'name'tag1|tag2|tag3)>(?'value'.*?)</\k'name'>.

EDIT

RegEx adapted as per question author's clarification.

Tomalak 2008-10-14 09:46:41

Tomalak, thats GREAT! Works perfectly, exactly what I needed. I tried upmoding you, but i'll have to register. I also tried accept answer but nothing happens.

mitch 2008-10-14 10:21:32

You are welcome anyway. ;-) But you are invited to register and accept the answer, if you want to return the favor.

Tomalak 2008-10-14 11:14:54

I will - promise.

mitch 2008-10-14 16:03:05

Here it is, accepted :)

mitch 2008-10-15 08:59:19

Answer 4

A:

This will give you named capture groups for what you want. It won't work for nested tags, however.

/<(?<name>[^>]+)>(?<value>[^<]+)</\1>/

ruquay 2008-10-14 09:48:43

Answer 5

+1 A:

Thanks all but none of the regexes work. :( Maybe I wasn't specific enough, sorry for that. Here is the exact html i'm trying to parse:

...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...

I hope it's clearer now. I'm after USER and MESSAGE tags.

I need to get two matches, each with two groups. First group wpould give me tag name (user or message) and the second group would give me entire inner text of the tag.

mitch 2008-10-14 09:54:56

I have made some amendments to my answer, please try again!

Tomalak 2008-10-14 10:04:15

That's not HTML... Not with standard DTD anyway.

PhiLho 2008-10-14 11:56:23

Answer 6

+1 A:

The problem was that the ([^<]*) people were using to match things inside the tags were matching the opening < of the nested tags, and then the closing tag of the nested tag didn't match the outer tag and so the regex failed.

Here is a slightly more robust version of Tomalak's regex allowing for attributes and whitespace:

Regex tagRegex = new Regex(@"<\s*(?<tag>" + string.Join("|", tags) + @")[^>]*>(?<content>.*?)<\s*/\s*\k<tag>\s*>", RegexOptions.IgnoreCase);

Obviously if you're only ever going to need to use a specific set of tags you can replace the

string.Joing("|", tags)

with the hardcoded pipe seperated list of tags.

Limitations of the regex are that if you have one tag you are trying to match nested inside another it will only match the outer tag. i.e.

<user>abc<message>def</message>ghi</user>

It will match the outer user tag, but not the inner message tag.

It also doesn't handle >'s quoted in attributes like so:

<user attrib="oops>">

It will just match

<user attrib="oops>

as the tag and the

">

will be a part of the tags content.

ICR 2008-10-14 10:47:36

ansaurus

tags:

views:

answers:

Regex for specifig tags and their content, groupped by the tag name

related questions