views:

412

answers:

5

How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:

<tagName>Data between the tag</tagName>

What would be the correct regex to find these values? (Note that tagName could be anything).

I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!

EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.

Thanks in advance.

A: 

with Perl:

my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;

where $1 is now filled with the data you captured

dls
I knew this already from perl, and the question is about c#.
Callum Rogers
sorry C Rogers - I failed to read all the tags!
dls
+5  A: 

You can use: <(\w+)>(.*?)<\/\1>

Group #1 is the tag, Group #2 is the content.

Michael Morton
Thanks, this is really useful.
Callum Rogers
+2  A: 

You can use a backreference like \1 to refer to an earlier match:

@"<([^>]*)>(.*)</\1>"

The \1 will match what was captured by the first parenthesized group.

John Kugelman
+3  A: 

Using regular expressions to parse XML is a terrible error.

This is efficient (it doesn't parse the XML into a DOM) and simple enough:

string s = "<tagName>Data between the tag</tagName>";

using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
    xr.Read();
    Console.WriteLine(xr.ReadElementContentAsString());
}

Edit:

Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:

Consider this fairly trivial test case:

<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>

There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.

You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.

And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.

Robert Rossney
I am trying to make my own "XMLReader". It will not be fast/efficient/usable or ever used, but I think that people should try to build things from the ground up rather than resorting to APIs all the time, so they at least know the ideas behind it and why the code they created was so bad. Are you really a computer scientist if you cannot do fast multiplication or even reverse a string without using .NET/Java/whatever's built in library. Perhaps not. You may be right about the regexs though. Even so, I will try, then fail, then learn.
Callum Rogers
I don't think you should mark someone down for pointing out the best way to achieve something, just because you deliberately want to do it the difficult way.
Dan Diplo
Granted, I just felt I had to explain my actions, choosing the difficult/fail root.
Callum Rogers
Thank you, this edit is very useful.
Callum Rogers
A: 

Going forward, if you get stuck check out regexlib.com

It's the first place I go when i get stuck on regex

AcousticBoom