tags:

views:

1717

answers:

6

Here is the input (html, not xml):

... html content ...
<tag1> content for tag 1 </tag1>
<tag2> content for tag 2 </tag2>
<tag3> content for tag 3 </tag3>
... html content ...

I would like to get 3 matches, each with two groups. First group would contain the name of the tag and the second group would contain the inner text of the tag. There are just those three tags, so it doesn't need to be universal.

In other words:

match.Groups["name"] would be "tag1"
match.Groups["value"] would be "content for tag 2"

Any ideas?

+1  A: 

Is the data proper xml, or does it just look like it?

If it is html, then the HTML Agility Pack is worth investigation - this provides a DOM (similar to XmlDocument) that you can use to query the data:

string input = @"<html>...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...</html>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(input);
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//user | //message"))
            {
                Console.WriteLine("{0}: {1}", node.Name, node.InnerText);
                // or node.InnerHtml to keep the formatting within the content
            }

This outputs:

user:  hello mitch
message:  some html message bla

If you want the formatting tags, then use .InnerHtml instead of .InnerText.

If it is xml, then to code with the full spectrum of xml, it would be better to use an xml parser. For small-to-mid size xml, loading it into a DOM such as XmlDocument would be fine - then query the nodes (for example, "//*"). For huge xml, XmlReader might be an option.

If the data doesn't have to worry about the full xml, then some simple regex shouldn't be too tricky... a simplified example (no attributes, no namespaces, no nested xml) might be:

string input = @"blah <tag1> content for tag 1 </tag1> blop
<tag2> content for tag 2 </tag2> bloop
<tag3> content for tag 3 </tag3> blip";

        const string pattern = @"<(\w+)>\s*([^<>]*)\s*</(\1)>";
        Console.WriteLine(Regex.IsMatch(input, pattern));
        foreach(Match match in Regex.Matches(input, pattern)) {
            Console.WriteLine("{0}: {1}", match.Groups[1], match.Groups[2]);
        }
Marc Gravell
Data is not valid xml, but html page.
mitch
I'll update to mention HTML Agility Pack
Marc Gravell
This looks very interesting, I will check it out, tnx.
mitch
A: 

Regex for this might be:

/<([^>]+)>([^<]+)<\/\1>/

But it's general as I don't know much about the escaping machanism of .NET. To translate it:

  • first group matches the first tag's name between < and >
  • second group matches the contents (from > to the next <
  • the end check if the first tag is closed

HTH

Zsolt Botykai
I tried it, but it doesn't match anything.
mitch
Note that, due to the [^<] character class for the tag content, this will fail on nested tags. .*? would be needed if nested tags are to be allowed. (Comment based on PCRE, which may or may not be equivalent to .NET's regex engine.)
Dave Sherohman
+1  A: 

I don't see why you would want to use match group names for that.

Here is a regular expression that would match tag name and tag content into numbered sub matches.

<(tag1|tag2|tag3)>(.*?)</$1>

Here is a variant with .NET style group names

<(?'name'tag1|tag2|tag3)>(?'value'.*?)</\k'name'>.

EDIT

RegEx adapted as per question author's clarification.

Tomalak
Tomalak, thats GREAT! Works perfectly, exactly what I needed. I tried upmoding you, but i'll have to register. I also tried accept answer but nothing happens.
mitch
You are welcome anyway. ;-) But you are invited to register and accept the answer, if you want to return the favor.
Tomalak
I will - promise.
mitch
Here it is, accepted :)
mitch
A: 

This will give you named capture groups for what you want. It won't work for nested tags, however.

/<(?<name>[^>]+)>(?<value>[^<]+)</\1>/

ruquay
+1  A: 

Thanks all but none of the regexes work. :( Maybe I wasn't specific enough, sorry for that. Here is the exact html i'm trying to parse:

...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...

I hope it's clearer now. I'm after USER and MESSAGE tags.

I need to get two matches, each with two groups. First group wpould give me tag name (user or message) and the second group would give me entire inner text of the tag.

mitch
I have made some amendments to my answer, please try again!
Tomalak
That's not HTML... Not with standard DTD anyway.
PhiLho
+1  A: 

The problem was that the ([^<]*) people were using to match things inside the tags were matching the opening < of the nested tags, and then the closing tag of the nested tag didn't match the outer tag and so the regex failed.

Here is a slightly more robust version of Tomalak's regex allowing for attributes and whitespace:

Regex tagRegex = new Regex(@"<\s*(?<tag>" + string.Join("|", tags) + @")[^>]*>(?<content>.*?)<\s*/\s*\k<tag>\s*>", RegexOptions.IgnoreCase);

Obviously if you're only ever going to need to use a specific set of tags you can replace the

string.Joing("|", tags)

with the hardcoded pipe seperated list of tags.

Limitations of the regex are that if you have one tag you are trying to match nested inside another it will only match the outer tag. i.e.

<user>abc<message>def</message>ghi</user>

It will match the outer user tag, but not the inner message tag.

It also doesn't handle >'s quoted in attributes like so:

<user attrib="oops>">

It will just match

<user attrib="oops>

as the tag and the

">

will be a part of the tags content.

ICR