ansaurus

Question

How to find a repeated string and the value between them using regexes?

Answer 1

A:

with Perl:

my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;

where $1 is now filled with the data you captured

dls 2009-07-22 19:20:28

I knew this already from perl, and the question is about c#.

Callum Rogers 2009-07-22 19:29:15

sorry C Rogers - I failed to read all the tags!

dls 2009-07-23 01:13:47

Answer 2

+5 A:

You can use: <(\w+)>(.*?)<\/\1>

Group #1 is the tag, Group #2 is the content.

Michael Morton 2009-07-22 19:22:42

Thanks, this is really useful.

Callum Rogers 2009-07-22 19:39:18

Answer 3

+2 A:

You can use a backreference like \1 to refer to an earlier match:

@"<([^>]*)>(.*)</\1>"

The \1 will match what was captured by the first parenthesized group.

John Kugelman 2009-07-22 19:23:03

Answer 4

+3 A:

Using regular expressions to parse XML is a terrible error.

This is efficient (it doesn't parse the XML into a DOM) and simple enough:

string s = "<tagName>Data between the tag</tagName>";

using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
    xr.Read();
    Console.WriteLine(xr.ReadElementContentAsString());
}

Edit:

Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:

Consider this fairly trivial test case:

<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>

There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.

You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.

And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.

Robert Rossney 2009-07-22 19:33:41

I am trying to make my own "XMLReader". It will not be fast/efficient/usable or ever used, but I think that people should try to build things from the ground up rather than resorting to APIs all the time, so they at least know the ideas behind it and why the code they created was so bad. Are you really a computer scientist if you cannot do fast multiplication or even reverse a string without using .NET/Java/whatever's built in library. Perhaps not. You may be right about the regexs though. Even so, I will try, then fail, then learn.

Callum Rogers 2009-07-22 19:35:32

I don't think you should mark someone down for pointing out the best way to achieve something, just because you deliberately want to do it the difficult way.

Dan Diplo 2009-07-22 19:54:44

Granted, I just felt I had to explain my actions, choosing the difficult/fail root.

Callum Rogers 2009-07-22 20:05:42

Thank you, this edit is very useful.

Callum Rogers 2009-07-22 20:41:47

Answer 5

A:

Going forward, if you get stuck check out regexlib.com

It's the first place I go when i get stuck on regex

AcousticBoom 2009-07-22 19:43:43

ansaurus

tags:

views:

answers:

How to find a repeated string and the value between them using regexes?

related questions