tags:

views:

2718

answers:

4

Before anybody asks, I am not doing any kind of screenscraping.

I'm trying to parse an html string to find a div with a certain id. I cannot for the life of me get this to work. The following expression worked in one instance, but not in another. I'm not sure if it has to do with extra elements in the html or not.

<div\s*?id=(\""|&quot;|&#34;)content(\""|&quot;|&#34;).*?>\s*?(?>(?! <div\s*?> | </div> ) | <div\s*?>(?<DEPTH>) | </div>(?<-DEPTH>) | .?)*(?(DEPTH)(?!))</div>

It is finding the first div with the right id correctly, but it then closes at the first closing div, and not the related div.

<div id="firstdiv">begining content<div id="content">some other stuff
    <div id="otherdiv">other stuff here</div>
    more stuff
    </div>
</div>

This should bring back

<div id="content">some other stuff
   <div id="otherdiv">other stuff here</div>
   more stuff
</div>

, but for some reason, it is not. It is bring back:

   <div id="content">some other stuff
      <div id="otherdiv">other stuff here</div>

Does anybody have an easier expression to handle this?

To clarify, this is in .NET, and I'm using the DEPTH keyword. You can find more details here.

+5  A: 

Are you asking for a regular expression that can keep track of the number of DIV tags nested inside a DIV tag? I'm afraid that isn't possible with regular expressions.

You could use a regular expression to get the index of the first DIV tag, then loop over the characters in the string, starting at that index, and keeping a count of the number of open div tags. When you encounter a close div-tag, and the count is zero, then you have the starting and ending indices in the string that contains the substring you want.

Cybis
I understand that there are recursive extensions that would allow this, but it cannot be done in pure regex.
Ben Doom
A: 

What programming language? If it's .Net and you're sure the html is well-formed you can load it into an XmlDocument or XDocument object and do an xpath query on it.

Joel Coehoorn
...and it would probably parse faster than that regular expression.
Bill the Lizard
+2  A: 

Cybis speaks the truth. This sort of stuff falls into Context-Free Languages, which are more powerful than Regular Languages (the kind of things covered by regular expressions). There's a lot of computer science theory involved, but let it rest to say that any language worth its salt will have a library for this sort of stuff written that you should probably be using.

Dan Fego
+4  A: 

In .NET you can do this:

(?<text>
(<div\s*?id=(\"|&quot;|&\#34;)content(\"|&quot;|&\#34;).*?>)

  (?>
      .*?</div>
    |
      .*?<div (?>depth)
    |
      .*?</div> (?>-depth)
  )*)
  (?(depth)(?!))
.*?</div>

You must use the singleline option. Here is an example using the console:

using System;
using System.Text.RegularExpressions;

namespace Temp
{
    class Program
    {
        static void Main()
        {
            string s = @"
<div id=""firstdiv"">begining content<div id=""content"">some other stuff
  <div id=""otherdiv"">other stuff here</div>
  more stuff
  </div>
</div>";
            Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|&quot;|&\#34;)"
                + @"content(\""|&quot;|&\#34;).*?>)(?>.*?</div>|.*?<div "
                + @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
                RegexOptions.Singleline);
            Console.WriteLine("HTML:\n");
            Console.WriteLine(s);
            Match m = r.Match(s);
            if (m.Success)
            {
                Console.WriteLine("\nCaptured text:\n");
                Console.WriteLine(m.Groups[4]);

            }
            Console.ReadLine();
        }
    }
}
pro3carp3
Leave it to Microsoft to change the definition of regular languages.
Cybis