ansaurus

Question

Regex for a string

Answer 1

+1 A:

You need to use a real parser. Things like infinitely nested tags can't be handled via regex.

Dusty 2010-10-09 15:19:03

This is stuff that belongs in a comment.

NullUserException 2010-10-09 15:20:40

Answer 2

+3 A:

Try this:

<div>([^<]+)(?:<\/div>)*<br>

As seen on rubular

Notes:

This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +

That being said, you should be wary of using regex to parse HTML.

In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.

NullUserException 2010-10-09 15:23:32

Note that this won't work if the 'abc' contains tags itself.

Dusty 2010-10-09 15:27:31

@Dusty Or anything with a `<` for that matter.

NullUserException 2010-10-09 15:29:03

@NUE - right. (sorry if it came out as dickish, I just wanted to point it out in case the op meant for abc to include more than plain text)

Dusty 2010-10-09 15:37:21

@Dusty Not at all. That actually should've been included in the post.

NullUserException 2010-10-09 15:38:45

Answer 3

A:

NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.

Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.

The rest of this post refers to the following section of the regex:

([^<]+?)

Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.

Do you want to match if there is nothing inside the div? If so change the + in the above to *

Finally, although it will work fine, you don't need the ? in the above.

Sid_M 2010-10-09 15:40:16

Answer 4

A:

I think, this regex is more flexible:

  <div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>

I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.

Vantomex 2010-10-09 15:43:43

Did you even test this?

NullUserException 2010-10-09 15:47:50

Sure, if I didn't, I'll put it in a comment instead of an answer.

Vantomex 2010-10-09 16:01:22

If it doesn't work, maybe your PCRE library didn't support possessive quantifiers. If so, try to remove every plus signs (+) after the asterisks from the regex. If that doesn't work too, remove the atomic grouping I made, that is, remove the `(?>` and its pair bracket. Good Luck!

Vantomex 2010-10-09 16:06:49

Answer 5

A:

You could also include a named group in the the expression, e.g.:

<div>(?<text>[^<]*)(?:<\/div>)*<br>

Implemented in C#:

var regex = new Regex(@"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));

Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));

leakyboat 2010-10-09 16:25:05

ansaurus

tags:

views:

answers:

Regex for a string

related questions