ansaurus

Question

Regular expression to get all the value between custom TAG

Answer 1

+3 A:

You can start with a very simple regex:

<myflash[^>]*>(.*?)</myflash>

Just make sure to use the "non-greedy" capture (.*?), so that the ".*" matches as little as possible.

Also, use RegexOptions.SingleLine, so that the dot matches every character, including \n:

Regex re = new Regex("<myflash[^>]*>(.*?)</myflash>", RegexOptions.SingleLine);

Ferdinand Beyer 2009-04-29 09:26:37

this expression is not working, might be because it has <param></param> tags inside it..

2009-04-29 10:10:46

Use RgexOptions.Multiline

majkinetor 2009-04-29 10:59:33

The PARAM tags shouldn't matter. Did you use the SingleLine flag? You might want to use IgnoreCase too, if your tags don't always use lowercase names. If that doesn't work, we would need to see your code, because the regex does exactly what you asked for.

Alan Moore 2009-04-29 11:00:49

@majkinetor, the Multiline flag won't change anything. It allows ^ and $ to match the beginning and end, respectively, of logical lines as well as the beginning and end of the whole string.

Alan Moore 2009-04-29 11:04:10

Ye... the point was actually to see if dot operator consumes new lines. I don't know why I contected that with Multine :)

majkinetor 2009-04-29 11:58:52

Note that the `>` is allowed in attribute values.

Gumbo 2009-04-29 13:21:43

The single-/multiline options are not just badly named, they shouldn't exist at all. They're a Perl-historical artifact, and in Perl 6 they've finally been done away with. Who knows how long the rest of us will be stuck with them. :-/

Alan Moore 2009-04-30 02:41:22

@Gumbo: No it isn't -- it must be encoded as entity (>, although browsers will tolerate it).

Ferdinand Beyer 2009-04-30 11:06:08

Answer 2

+3 A:

Regex is, IMO, the wrong tool for processing XML. Why not use XmlDocument or XDocument etc? If that is HTML (note no "X"), then the HTML Agility Pack may be useful.

With both XmlDocument and the HTML Agility Pack you can use xpath/xquery, so you can simply use .SelectNodes("//myflash"). XDocument has similar, but a different method: .Descendants("myFlash").

Marc Gravell 2009-04-29 09:36:09

+1 No regex for markup!

Andrew Hare 2009-04-29 10:13:55

-1 That isn't the answer ... You provide the answer then eventual notes. Notes without answers are no good.

majkinetor 2009-04-29 11:00:20

@majkinetor - how does .SelectNodes("//myflash") not answer it? It is the work of 2 seconds to discover .InnerXml and .OuterXml, for example. The reason I didn't include this is because the route is different for each of the 3 options, and that choice depends on a: xml vs html (not specified in the question), and b: XmlDocument vs XDocument (which repends on the .NET version, not specified in the question). So go on then: how would you unambiguously answer it?

Marc Gravell 2009-04-29 11:51:19

Its not becuase the man asked for RE, not XPath. Instead of speculating about methods he use (your advice is sound, thats not the problem) its better to answer the real question, then offer alternative (or semantically better) method.

majkinetor 2009-04-29 12:00:42

@majkinetor - right, and if somebody asks for a hammer to put some screws in, do you hand them a hammer? Or do you tell them about screwdrivers?

Marc Gravell 2009-04-29 12:21:54

I give them a hammer and tell them about screwdriver :P

majkinetor 2009-04-29 16:07:00

Answer 3

A:

As Marc Gravell says, regexes are not suited to parsing HTML (or XML). See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. You are much better off using an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples of how to use parsers in many languages (there are at least two examples using C#).

Chas. Owens 2009-04-29 14:36:38

ansaurus

tags:

views:

answers:

Regular expression to get all the value between custom TAG

related questions