ansaurus

Question

Regex - Find Content of div by id with nested divs

Answer 1

+5 A:

Are you asking for a regular expression that can keep track of the number of DIV tags nested inside a DIV tag? I'm afraid that isn't possible with regular expressions.

You could use a regular expression to get the index of the first DIV tag, then loop over the characters in the string, starting at that index, and keeping a count of the number of open div tags. When you encounter a close div-tag, and the count is zero, then you have the starting and ending indices in the string that contains the substring you want.

Cybis 2008-11-13 02:46:51

I understand that there are recursive extensions that would allow this, but it cannot be done in pure regex.

Ben Doom 2008-11-13 15:18:55

Answer 2

A:

What programming language? If it's .Net and you're sure the html is well-formed you can load it into an XmlDocument or XDocument object and do an xpath query on it.

Joel Coehoorn 2008-11-13 02:52:36

...and it would probably parse faster than that regular expression.

Bill the Lizard 2008-11-13 03:31:16

Answer 3

+2 A:

Cybis speaks the truth. This sort of stuff falls into Context-Free Languages, which are more powerful than Regular Languages (the kind of things covered by regular expressions). There's a lot of computer science theory involved, but let it rest to say that any language worth its salt will have a library for this sort of stuff written that you should probably be using.

Dan Fego 2008-11-13 02:53:06

Answer 4

+4 A:

In .NET you can do this:

(?<text>
(<div\s*?id=(\"|&quot;|&\#34;)content(\"|&quot;|&\#34;).*?>)

  (?>
      .*?</div>
    |
      .*?<div (?>depth)
    |
      .*?</div> (?>-depth)
  )*)
  (?(depth)(?!))
.*?</div>

You must use the singleline option. Here is an example using the console:

using System;
using System.Text.RegularExpressions;

namespace Temp
{
    class Program
    {
        static void Main()
        {
            string s = @"
<div id=""firstdiv"">begining content<div id=""content"">some other stuff
  <div id=""otherdiv"">other stuff here</div>
  more stuff
  </div>
</div>";
            Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|&quot;|&\#34;)"
                + @"content(\""|&quot;|&\#34;).*?>)(?>.*?</div>|.*?<div "
                + @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
                RegexOptions.Singleline);
            Console.WriteLine("HTML:\n");
            Console.WriteLine(s);
            Match m = r.Match(s);
            if (m.Success)
            {
                Console.WriteLine("\nCaptured text:\n");
                Console.WriteLine(m.Groups[4]);

            }
            Console.ReadLine();
        }
    }
}

pro3carp3 2008-11-13 18:20:17

Leave it to Microsoft to change the definition of regular languages.

Cybis 2008-11-13 21:17:50

ansaurus

tags:

views:

answers:

Regex - Find Content of div by id with nested divs

related questions